Non-Autoregressive Fully Parallel Deep Convolutional Neural Speech Synthesis

Lee, Moa; Lee, Junmo; Chang, Joon-Hyuk

doi:10.1109/TASLP.2022.3156797

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Non-Autoregressive Fully Parallel Deep Convolutional Neural Speech Synthesis

Full metadata record

DC Field	Value	Language
dc.contributor.author	Lee, Moa	-
dc.contributor.author	Lee, Junmo	-
dc.contributor.author	Chang, Joon-Hyuk	-
dc.date.accessioned	2022-07-06T08:36:18Z	-
dc.date.available	2022-07-06T08:36:18Z	-
dc.date.created	2022-04-06	-
dc.date.issued	2022-03	-
dc.identifier.issn	2329-9290	-
dc.identifier.uri	https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/139269	-
dc.description.abstract	Deep learning-based speech synthesis evolves by employing a sequence-to-sequence (seq2seq) structure with an attention mechanism. The seq2seq speech synthesis model consists of a pair of the encoder for delivering the linguistic features and the decoder for predicting the mel-spectrogram, and learns the alignment between text and speech through the attention mechanism. The decoder predicts the mel-spectrogram by an autoregressive flow that considers the current input and what they have learned from previous inputs. This is beneficial when processing the sequential data, as in speech synthesis. However, the recursive generation of speech typically requires extensive training time, which slows the speed of synthesis. To overcome these obstacles, we propose a non-autoregressive framework for fully parallel deep convolutional neural speech synthesis. Firstly, we design a new synthesis paradigm that integrates a time-varying metatemplate (TVMT), whose length is modeled with a separate conditional distribution, to prepare the decoder input. The decoding step converts the TVMT into spectral features, which eliminates the autoregressive flow. Secondly, we propose a structure that uses multiple decoders interconnected by up-down chains with an iterative attention mechanism. The decoder chains distribute the burden of decoding, progressively infusing the information obtained from the training target example into the chains to refine the predicted spectral features at each decoding step. For each decoder, the attention mechanism is repeatedly applied to produce the elaborated alignment between the linguistic features and the TVMT, which is gradually transformed into the spectral features. The proposed architecture substantially improves the synthesis speed, and the resulting speech quality is superior to that of a conventional autoregressive model.	-
dc.language	영어	-
dc.language.iso	en	-
dc.publisher	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC	-
dc.title	Non-Autoregressive Fully Parallel Deep Convolutional Neural Speech Synthesis	-
dc.type	Article	-
dc.contributor.affiliatedAuthor	Chang, Joon-Hyuk	-
dc.identifier.doi	10.1109/TASLP.2022.3156797	-
dc.identifier.scopusid	2-s2.0-85126325429	-
dc.identifier.wosid	000772400900001	-
dc.identifier.bibliographicCitation	IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, v.30, pp.1150 - 1159	-
dc.relation.isPartOf	IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING	-
dc.citation.title	IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING	-
dc.citation.volume	30	-
dc.citation.startPage	1150	-
dc.citation.endPage	1159	-
dc.type.rims	ART	-
dc.type.docType	Article	-
dc.description.journalClass	1	-
dc.description.isOpenAccess	N	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalResearchArea	Acoustics	-
dc.relation.journalResearchArea	Engineering	-
dc.relation.journalWebOfScienceCategory	Acoustics	-
dc.relation.journalWebOfScienceCategory	Engineering, Electrical & Electronic	-
dc.subject.keywordPlus	Chains	-
dc.subject.keywordPlus	Convolution	-
dc.subject.keywordPlus	Data handling	-
dc.subject.keywordPlus	Deep learning	-
dc.subject.keywordPlus	Iterative decoding	-
dc.subject.keywordPlus	Linguistics	-
dc.subject.keywordPlus	Speech synthesis	-
dc.subject.keywordPlus	Spectrographs	-
dc.subject.keywordPlus	Attention mechanisms	-
dc.subject.keywordPlus	Attentionbased end-to-end speech synthesis	-
dc.subject.keywordPlus	Auto-regressive	-
dc.subject.keywordPlus	Deep learning	-
dc.subject.keywordPlus	End to end	-
dc.subject.keywordPlus	Iterative decodings	-
dc.subject.keywordPlus	Spectral feature	-
dc.subject.keywordPlus	Spectrograms	-
dc.subject.keywordPlus	Text to speech	-
dc.subject.keywordPlus	Time varying	-
dc.subject.keywordAuthor	Speech synthesis	-
dc.subject.keywordAuthor	Decoding	-
dc.subject.keywordAuthor	Training	-
dc.subject.keywordAuthor	Iterative decoding	-
dc.subject.keywordAuthor	Data models	-
dc.subject.keywordAuthor	Linguistics	-
dc.subject.keywordAuthor	Spectrogram	-
dc.subject.keywordAuthor	Speech synthesis	-
dc.subject.keywordAuthor	text-to-speech	-
dc.subject.keywordAuthor	attention-based end-to-end speech synthesis	-
dc.subject.keywordAuthor	deep learning	-
dc.identifier.url	https://ieeexplore.ieee.org/document/9729756	-

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Chang, Joon-Hyuk photo

Chang, Joon-Hyuk: COLLEGE OF ENGINEERING (SCHOOL OF ELECTRONIC ENGINEERING)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1365

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE