H4C-TTS: Leveraging Multi-Modal Historical Context for Conversational Text-to-Speech

Seong, Donghyun; Chang, Joon-Hyuk

doi:10.21437/Interspeech.2024-1480

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

H4C-TTS: Leveraging Multi-Modal Historical Context for Conversational Text-to-Speech

Full metadata record

DC Field	Value	Language
dc.contributor.author	Seong, Donghyun	-
dc.contributor.author	Chang, Joon-Hyuk	-
dc.date.accessioned	2025-02-13T02:00:11Z	-
dc.date.available	2025-02-13T02:00:11Z	-
dc.date.issued	2024-09	-
dc.identifier.issn	1990-9772	-
dc.identifier.uri	https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/206475	-
dc.description.abstract	Conversational text-to-speech (TTS) aims to synthesize natural voices appropriate to a situation by considering the context of past conversations as well as the current text. However, analyzing and modeling the context of a conversation remains challenging. Most conversational TTS use the content of historical and recent conversations without distinguishing between them and often generate speech that does not fit the situation. Hence, we introduce a novel conversational TTS, H4C-TTS, that leverages multi-modal historical context to realize contextually appropriate natural speech synthesis. To facilitate conversational context modeling, we design a context encoder that incorporates historical and recent contexts and a multi-modal encoder that processes textual and acoustic inputs. Experimental results demonstrate that the proposed model significantly improves the naturalness and quality of speech in conversational contexts compared with existing conversational TTS.	-
dc.format.extent	5	-
dc.language	영어	-
dc.language.iso	ENG	-
dc.title	H4C-TTS: Leveraging Multi-Modal Historical Context for Conversational Text-to-Speech	-
dc.type	Article	-
dc.identifier.doi	10.21437/Interspeech.2024-1480	-
dc.identifier.scopusid	2-s2.0-85214843687	-
dc.identifier.wosid	001331850105009	-
dc.identifier.bibliographicCitation	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp 4933 - 4937	-
dc.citation.title	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH	-
dc.citation.startPage	4933	-
dc.citation.endPage	4937	-
dc.type.docType	Proceedings Paper	-
dc.description.isOpenAccess	N	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalResearchArea	Computer Science	-
dc.relation.journalWebOfScienceCategory	Computer Science, Artificial Intelligence	-
dc.subject.keywordPlus	'current	-
dc.subject.keywordPlus	Context models	-
dc.subject.keywordPlus	Conversational speech	-
dc.subject.keywordPlus	Conversational speech synthesis	-
dc.subject.keywordPlus	Multi-modal	-
dc.subject.keywordPlus	Natural speech	-
dc.subject.keywordPlus	Quality of speech	-
dc.subject.keywordPlus	Text to speech	-
dc.subject.keywordAuthor	conversational speech synthesis	-
dc.subject.keywordAuthor	multi-modal	-
dc.subject.keywordAuthor	Text-to-speech	-

Files in This Item: There are no files associated with this item.

Appears in Collections: 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Chang, Joon-Hyuk photo

Chang, Joon-Hyuk: COLLEGE OF ENGINEERING (SCHOOL OF ELECTRONIC ENGINEERING)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE