Cited 0 time in
Vector Field Decomposition-based Flow Matching for Zero-Shot Cross-Lingual Text-to-Speech
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Lee, Jaeuk | - |
| dc.contributor.author | Song, Nam-Seok | - |
| dc.contributor.author | Chang, Joon-Hyuk | - |
| dc.date.accessioned | 2025-11-26T08:00:55Z | - |
| dc.date.available | 2025-11-26T08:00:55Z | - |
| dc.date.issued | 2025-05 | - |
| dc.identifier.issn | 1070-9908 | - |
| dc.identifier.issn | 1558-2361 | - |
| dc.identifier.uri | https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/209338 | - |
| dc.description.abstract | Zero-shot text-to-speech (TTS) has recently achieved remarkable performance by leveraging a speech prompt instead of a speaker embedding, as it provides richer information. However, zero-shot cross-lingual tasks synthesize speech in multiple languages according to a given language ID, regardless of the language of the speech prompt. Consequently, the inherent language-specific characteristics of the speech prompt may conflict with the language ID, potentially affecting the accuracy of language representation in speech. Thus, we propose vector field decomposition-based flow matching that decomposes the vector field into speaker and language components. These components are trained to be activated in different frequency bins, as speaker and language identity are distributed across distinct frequency ranges in speech. This approach is particularly effective for cross-lingual TTS, as it minimizes conflicts between speech prompts and language IDs. As a result, the summation of the two components directly forms the vector field that represents the probability path from a Gaussian distribution to the target data distribution (e.g., mel spectrogram). Experimental results demonstrate that the proposed method outperforms the conventional method in terms of both subjective and objective evaluations. | - |
| dc.format.extent | 5 | - |
| dc.language | 영어 | - |
| dc.language.iso | ENG | - |
| dc.publisher | Institute of Electrical and Electronics Engineers | - |
| dc.title | Vector Field Decomposition-based Flow Matching for Zero-Shot Cross-Lingual Text-to-Speech | - |
| dc.type | Article | - |
| dc.publisher.location | 미국 | - |
| dc.identifier.doi | 10.1109/LSP.2025.3571407 | - |
| dc.identifier.scopusid | 2-s2.0-105005878653 | - |
| dc.identifier.wosid | 001579027500007 | - |
| dc.identifier.bibliographicCitation | IEEE Signal Processing Letters, v.32, pp 3560 - 3564 | - |
| dc.citation.title | IEEE Signal Processing Letters | - |
| dc.citation.volume | 32 | - |
| dc.citation.startPage | 3560 | - |
| dc.citation.endPage | 3564 | - |
| dc.type.docType | Article | - |
| dc.description.isOpenAccess | N | - |
| dc.description.journalRegisteredClass | scie | - |
| dc.description.journalRegisteredClass | scopus | - |
| dc.relation.journalResearchArea | Engineering | - |
| dc.relation.journalWebOfScienceCategory | Engineering, Electrical & Electronic | - |
| dc.subject.keywordPlus | TTS | - |
| dc.subject.keywordAuthor | Flow matching | - |
| dc.subject.keywordAuthor | speech prompt | - |
| dc.subject.keywordAuthor | zero-shot cross-lingual text-to-speech | - |
| dc.identifier.url | https://ieeexplore.ieee.org/document/11006940 | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366
COPYRIGHT © 2024 HANYANG UNIVERSITY.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.
