Improving Joint Speech and Emotion Recognition Using Global Style Tokens
- Authors
- Kyung, Jehyun; Seong, Ju-Seok; Choi, Jeong-Hwan; Jeoung, Ye-Rin; Chang, Joon-Hyuk
- Issue Date
- Aug-2023
- Publisher
- International Speech Communication Association
- Keywords
- automatic speech recognition; global style tokens; speech emotion recognition
- Citation
- Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, v.2023, pp.4528 - 4532
- Indexed
- SCOPUS
- Journal Title
- Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
- Volume
- 2023
- Start Page
- 4528
- End Page
- 4532
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/191792
- DOI
- 10.21437/Interspeech.2023-2375
- ISSN
- 2308-457X
- Abstract
- Automatic speech recognition (ASR) and speech emotion recognition (SER) are closely related in that the acoustic features of speech, such as pitch, tone, and intensity, can vary according to the speaker's emotional state. Our study focuses on a joint ASR and SER task, in which an emotion token is tagged and recognized along with the text. To further improve the joint recognition performance, we propose a novel training method that adopts the global style tokens (GSTs). The style embedding is extracted from the GSTs module to enhance the joint ASR and SER model to capture emotional information from speech. Specifically, a conformer-based joint ASR and SER model pre-trained on a large-scale dataset is jointly fine-tuned with style embedding to improve both ASR and SER. The experimental results on the IEMOCAP dataset showed that the proposed model achieves a word error rate of 15.8% and four emotion classification weighted and unweighted accuracy of 75.1% and 76.3%, respectively.
- Files in This Item
-
Go to Link
- Appears in
Collections - 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.