Cited 0 time in
H4C-TTS: Leveraging Multi-Modal Historical Context for Conversational Text-to-Speech
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Seong, Donghyun | - |
| dc.contributor.author | Chang, Joon-Hyuk | - |
| dc.date.accessioned | 2025-02-13T02:00:11Z | - |
| dc.date.available | 2025-02-13T02:00:11Z | - |
| dc.date.issued | 2024-09 | - |
| dc.identifier.issn | 1990-9772 | - |
| dc.identifier.uri | https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/206475 | - |
| dc.description.abstract | Conversational text-to-speech (TTS) aims to synthesize natural voices appropriate to a situation by considering the context of past conversations as well as the current text. However, analyzing and modeling the context of a conversation remains challenging. Most conversational TTS use the content of historical and recent conversations without distinguishing between them and often generate speech that does not fit the situation. Hence, we introduce a novel conversational TTS, H4C-TTS, that leverages multi-modal historical context to realize contextually appropriate natural speech synthesis. To facilitate conversational context modeling, we design a context encoder that incorporates historical and recent contexts and a multi-modal encoder that processes textual and acoustic inputs. Experimental results demonstrate that the proposed model significantly improves the naturalness and quality of speech in conversational contexts compared with existing conversational TTS. | - |
| dc.format.extent | 5 | - |
| dc.language | 영어 | - |
| dc.language.iso | ENG | - |
| dc.title | H4C-TTS: Leveraging Multi-Modal Historical Context for Conversational Text-to-Speech | - |
| dc.type | Article | - |
| dc.identifier.doi | 10.21437/Interspeech.2024-1480 | - |
| dc.identifier.scopusid | 2-s2.0-85214843687 | - |
| dc.identifier.wosid | 001331850105009 | - |
| dc.identifier.bibliographicCitation | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp 4933 - 4937 | - |
| dc.citation.title | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH | - |
| dc.citation.startPage | 4933 | - |
| dc.citation.endPage | 4937 | - |
| dc.type.docType | Proceedings Paper | - |
| dc.description.isOpenAccess | N | - |
| dc.description.journalRegisteredClass | scopus | - |
| dc.relation.journalResearchArea | Computer Science | - |
| dc.relation.journalWebOfScienceCategory | Computer Science, Artificial Intelligence | - |
| dc.subject.keywordPlus | 'current | - |
| dc.subject.keywordPlus | Context models | - |
| dc.subject.keywordPlus | Conversational speech | - |
| dc.subject.keywordPlus | Conversational speech synthesis | - |
| dc.subject.keywordPlus | Multi-modal | - |
| dc.subject.keywordPlus | Natural speech | - |
| dc.subject.keywordPlus | Quality of speech | - |
| dc.subject.keywordPlus | Text to speech | - |
| dc.subject.keywordAuthor | conversational speech synthesis | - |
| dc.subject.keywordAuthor | multi-modal | - |
| dc.subject.keywordAuthor | Text-to-speech | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366
COPYRIGHT © 2024 HANYANG UNIVERSITY.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.
