Cited 0 time in
Multimodal Emotion Recognition with Target Speaker-Based Facial Embeddings
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Heo, Serin | - |
| dc.contributor.author | Kyung, Jehyun | - |
| dc.contributor.author | Chang, Joon-Hyuk | - |
| dc.date.accessioned | 2025-05-28T02:00:10Z | - |
| dc.date.available | 2025-05-28T02:00:10Z | - |
| dc.date.issued | 2025-03 | - |
| dc.identifier.issn | 0736-7791 | - |
| dc.identifier.issn | 1520-6149 | - |
| dc.identifier.uri | https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/207455 | - |
| dc.description.abstract | Effectively recognizing emotions requires sophisticated approaches for interpreting diverse modalities, particularly in real-world scenarios where multiple data sources, such as speech, text, and visual cues, are often noisy and incomplete. This study proposes an advanced multimodal emotion recognition system that integrates these three modalities by adding the speaker detection and extraction algorithm within visual data. The pre-trained Q-Former used in the proposed system then captures and interprets visual signals supported with designated prompts, resulting in facial-related features that significantly improve emotion recognition performance. We then utilize a cross-modal transformer to unify the visual, speech, and text embeddings for accurate emotion classification. We achieved a 2.9% and 3.3% improvement in accuracy and F1 score, respectively, on the MELD dataset compared to the baseline. | - |
| dc.format.extent | 5 | - |
| dc.language | 영어 | - |
| dc.language.iso | ENG | - |
| dc.publisher | Institute of Electrical and Electronics Engineers Inc. | - |
| dc.title | Multimodal Emotion Recognition with Target Speaker-Based Facial Embeddings | - |
| dc.type | Article | - |
| dc.publisher.location | 미국 | - |
| dc.identifier.doi | 10.1109/ICASSP49660.2025.10888205 | - |
| dc.identifier.scopusid | 2-s2.0-105003892031 | - |
| dc.identifier.bibliographicCitation | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp 1 - 5 | - |
| dc.citation.title | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings | - |
| dc.citation.startPage | 1 | - |
| dc.citation.endPage | 5 | - |
| dc.type.docType | Conference paper | - |
| dc.description.isOpenAccess | N | - |
| dc.description.journalRegisteredClass | scopus | - |
| dc.subject.keywordAuthor | cross-modal attention | - |
| dc.subject.keywordAuthor | multimodal emotion recognition | - |
| dc.subject.keywordAuthor | query transformer | - |
| dc.subject.keywordAuthor | target speaker | - |
| dc.identifier.url | https://ieeexplore.ieee.org/document/10888205 | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366
COPYRIGHT © 2024 HANYANG UNIVERSITY.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.
