Cited 0 time in
TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Seong, Donghyun | - |
| dc.contributor.author | Lee, Hoyoung | - |
| dc.contributor.author | Chang, Joon-Hyuk | - |
| dc.date.accessioned | 2025-02-12T07:00:41Z | - |
| dc.date.available | 2025-02-12T07:00:41Z | - |
| dc.date.issued | 2024-09 | - |
| dc.identifier.issn | 1990-9772 | - |
| dc.identifier.uri | https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/206466 | - |
| dc.description.abstract | Expressive text-to-speech (TTS) aims to synthesize better human-like speech by incorporating diverse speech styles or emotions. While most expressive TTS models rely on reference speech to condition the style of the generated speech, they often fail to generate speech of regular quality. To ensure consistent speech quality, we propose an expressive TTS conditioned on style representation extracted from the text itself. To implement this text-based style predictor, we design a style module incorporating residual vector quantization. Furthermore, the style representation is enhanced through style-to-text alignment and a mel decoder with style hierarchical layer normalization (SHLN). Our experimental findings demonstrate that our proposed model accurately estimates style representation, enabling the generation of high-quality speech without the need for reference speech. | - |
| dc.format.extent | 5 | - |
| dc.language | 영어 | - |
| dc.language.iso | ENG | - |
| dc.title | TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech | - |
| dc.type | Article | - |
| dc.identifier.doi | 10.21437/Interspeech.2024-1734 | - |
| dc.identifier.scopusid | 2-s2.0-85214799703 | - |
| dc.identifier.wosid | 001331850101185 | - |
| dc.identifier.bibliographicCitation | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp 1780 - 1784 | - |
| dc.citation.title | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH | - |
| dc.citation.startPage | 1780 | - |
| dc.citation.endPage | 1784 | - |
| dc.type.docType | Proceedings Paper | - |
| dc.description.isOpenAccess | N | - |
| dc.description.journalRegisteredClass | scopus | - |
| dc.relation.journalResearchArea | Computer Science | - |
| dc.relation.journalWebOfScienceCategory | Computer Science, Artificial Intelligence | - |
| dc.subject.keywordPlus | Condition | - |
| dc.subject.keywordPlus | Expressive speech synthesis | - |
| dc.subject.keywordPlus | Human like | - |
| dc.subject.keywordPlus | Residual vector quantizations | - |
| dc.subject.keywordPlus | Speech emotions | - |
| dc.subject.keywordPlus | Speech models | - |
| dc.subject.keywordPlus | Speech quality | - |
| dc.subject.keywordPlus | Speech style | - |
| dc.subject.keywordPlus | Text alignments | - |
| dc.subject.keywordPlus | Text to speech | - |
| dc.subject.keywordAuthor | expressive speech synthesis | - |
| dc.subject.keywordAuthor | residual vector quantization | - |
| dc.subject.keywordAuthor | Text-to-speech | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366
COPYRIGHT © 2024 HANYANG UNIVERSITY.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.
