Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Non-Autoregressive Fully Parallel Deep Convolutional Neural Speech Synthesis

Full metadata record
DC Field Value Language
dc.contributor.authorLee, Moa-
dc.contributor.authorLee, Junmo-
dc.contributor.authorChang, Joon-Hyuk-
dc.date.accessioned2022-07-06T08:36:18Z-
dc.date.available2022-07-06T08:36:18Z-
dc.date.created2022-04-06-
dc.date.issued2022-03-
dc.identifier.issn2329-9290-
dc.identifier.urihttps://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/139269-
dc.description.abstractDeep learning-based speech synthesis evolves by employing a sequence-to-sequence (seq2seq) structure with an attention mechanism. The seq2seq speech synthesis model consists of a pair of the encoder for delivering the linguistic features and the decoder for predicting the mel-spectrogram, and learns the alignment between text and speech through the attention mechanism. The decoder predicts the mel-spectrogram by an autoregressive flow that considers the current input and what they have learned from previous inputs. This is beneficial when processing the sequential data, as in speech synthesis. However, the recursive generation of speech typically requires extensive training time, which slows the speed of synthesis. To overcome these obstacles, we propose a non-autoregressive framework for fully parallel deep convolutional neural speech synthesis. Firstly, we design a new synthesis paradigm that integrates a time-varying metatemplate (TVMT), whose length is modeled with a separate conditional distribution, to prepare the decoder input. The decoding step converts the TVMT into spectral features, which eliminates the autoregressive flow. Secondly, we propose a structure that uses multiple decoders interconnected by up-down chains with an iterative attention mechanism. The decoder chains distribute the burden of decoding, progressively infusing the information obtained from the training target example into the chains to refine the predicted spectral features at each decoding step. For each decoder, the attention mechanism is repeatedly applied to produce the elaborated alignment between the linguistic features and the TVMT, which is gradually transformed into the spectral features. The proposed architecture substantially improves the synthesis speed, and the resulting speech quality is superior to that of a conventional autoregressive model.-
dc.language영어-
dc.language.isoen-
dc.publisherIEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC-
dc.titleNon-Autoregressive Fully Parallel Deep Convolutional Neural Speech Synthesis-
dc.typeArticle-
dc.contributor.affiliatedAuthorChang, Joon-Hyuk-
dc.identifier.doi10.1109/TASLP.2022.3156797-
dc.identifier.scopusid2-s2.0-85126325429-
dc.identifier.wosid000772400900001-
dc.identifier.bibliographicCitationIEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, v.30, pp.1150 - 1159-
dc.relation.isPartOfIEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING-
dc.citation.titleIEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING-
dc.citation.volume30-
dc.citation.startPage1150-
dc.citation.endPage1159-
dc.type.rimsART-
dc.type.docTypeArticle-
dc.description.journalClass1-
dc.description.isOpenAccessN-
dc.description.journalRegisteredClassscie-
dc.description.journalRegisteredClassscopus-
dc.relation.journalResearchAreaAcoustics-
dc.relation.journalResearchAreaEngineering-
dc.relation.journalWebOfScienceCategoryAcoustics-
dc.relation.journalWebOfScienceCategoryEngineering, Electrical & Electronic-
dc.subject.keywordPlusChains-
dc.subject.keywordPlusConvolution-
dc.subject.keywordPlusData handling-
dc.subject.keywordPlusDeep learning-
dc.subject.keywordPlusIterative decoding-
dc.subject.keywordPlusLinguistics-
dc.subject.keywordPlusSpeech synthesis-
dc.subject.keywordPlusSpectrographs-
dc.subject.keywordPlusAttention mechanisms-
dc.subject.keywordPlusAttentionbased end-to-end speech synthesis-
dc.subject.keywordPlusAuto-regressive-
dc.subject.keywordPlusDeep learning-
dc.subject.keywordPlusEnd to end-
dc.subject.keywordPlusIterative decodings-
dc.subject.keywordPlusSpectral feature-
dc.subject.keywordPlusSpectrograms-
dc.subject.keywordPlusText to speech-
dc.subject.keywordPlusTime varying-
dc.subject.keywordAuthorSpeech synthesis-
dc.subject.keywordAuthorDecoding-
dc.subject.keywordAuthorTraining-
dc.subject.keywordAuthorIterative decoding-
dc.subject.keywordAuthorData models-
dc.subject.keywordAuthorLinguistics-
dc.subject.keywordAuthorSpectrogram-
dc.subject.keywordAuthorSpeech synthesis-
dc.subject.keywordAuthortext-to-speech-
dc.subject.keywordAuthorattention-based end-to-end speech synthesis-
dc.subject.keywordAuthordeep learning-
dc.identifier.urlhttps://ieeexplore.ieee.org/document/9729756-
Files in This Item
Go to Link
Appears in
Collections
서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Chang, Joon-Hyuk photo

Chang, Joon-Hyuk
COLLEGE OF ENGINEERING (SCHOOL OF ELECTRONIC ENGINEERING)
Read more

Altmetrics

Total Views & Downloads

BROWSE