TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech
- Authors
- Seong, Donghyun; Lee, Hoyoung; Chang, Joon-Hyuk
- Issue Date
- Sep-2024
- Keywords
- expressive speech synthesis; residual vector quantization; Text-to-speech
- Citation
- Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp 1780 - 1784
- Pages
- 5
- Indexed
- SCOPUS
- Journal Title
- Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
- Start Page
- 1780
- End Page
- 1784
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/206466
- DOI
- 10.21437/Interspeech.2024-1734
- ISSN
- 1990-9772
- Abstract
- Expressive text-to-speech (TTS) aims to synthesize better human-like speech by incorporating diverse speech styles or emotions. While most expressive TTS models rely on reference speech to condition the style of the generated speech, they often fail to generate speech of regular quality. To ensure consistent speech quality, we propose an expressive TTS conditioned on style representation extracted from the text itself. To implement this text-based style predictor, we design a style module incorporating residual vector quantization. Furthermore, the style representation is enhanced through style-to-text alignment and a mel decoder with style hierarchical layer normalization (SHLN). Our experimental findings demonstrate that our proposed model accurately estimates style representation, enabling the generation of high-quality speech without the need for reference speech.
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.