Multimodal Emotion Recognition with Target Speaker-Based Facial Embeddings
- Authors
- Heo, Serin; Kyung, Jehyun; Chang, Joon-Hyuk
- Issue Date
- Mar-2025
- Publisher
- Institute of Electrical and Electronics Engineers Inc.
- Keywords
- cross-modal attention; multimodal emotion recognition; query transformer; target speaker
- Citation
- ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp 1 - 5
- Pages
- 5
- Indexed
- SCOPUS
- Journal Title
- ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
- Start Page
- 1
- End Page
- 5
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/207455
- DOI
- 10.1109/ICASSP49660.2025.10888205
- ISSN
- 0736-7791
1520-6149
- Abstract
- Effectively recognizing emotions requires sophisticated approaches for interpreting diverse modalities, particularly in real-world scenarios where multiple data sources, such as speech, text, and visual cues, are often noisy and incomplete. This study proposes an advanced multimodal emotion recognition system that integrates these three modalities by adding the speaker detection and extraction algorithm within visual data. The pre-trained Q-Former used in the proposed system then captures and interprets visual signals supported with designated prompts, resulting in facial-related features that significantly improve emotion recognition performance. We then utilize a cross-modal transformer to unify the visual, speech, and text embeddings for accurate emotion classification. We achieved a 2.9% and 3.3% improvement in accuracy and F1 score, respectively, on the MELD dataset compared to the baseline.
- Files in This Item
-
Go to Link
- Appears in
Collections - 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.