Neural ATSM: Fully Neural Network-based Adaptive Time-Scale Modification Using Sentence-Specific Dynamic Control

Lee, Jaeuk; Jang, Sohee; Chang, Joon-Hyuk

doi:10.21437/Interspeech.2024-2380

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Neural ATSM: Fully Neural Network-based Adaptive Time-Scale Modification Using Sentence-Specific Dynamic Control

Full metadata record

DC Field	Value	Language
dc.contributor.author	Lee, Jaeuk	-
dc.contributor.author	Jang, Sohee	-
dc.contributor.author	Chang, Joon-Hyuk	-
dc.date.accessioned	2025-02-12T08:00:33Z	-
dc.date.available	2025-02-12T08:00:33Z	-
dc.date.issued	2024-09	-
dc.identifier.issn	1990-9772	-
dc.identifier.uri	https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/206471	-
dc.description.abstract	Adaptive time-scale modification (ATSM) adaptively adjusts audio speed and improves upon previous systems by tailoring the scale for each phoneme in two steps: phoneme positioning via Montreal forced aligner (MFA) and reconstruction with adaptive speaking rate. However, ATSM's phoneme-specific rate is constant regardless of sentences, and MFA struggles with precise phoneme alignment in synthetic speech. Driven by this, we propose a fully neural networks-based ATSM (Neural ATSM) that dynamically controls each phoneme's speaking rate to vary from sentence to sentence. It predicts phoneme-level rates using a speaking rate predictor and flexibly modifies the scales to fit sentence context using Gaussian upsampling and attention mechanism, ensuring feature similarity with Soft-dynamic time warping (DTW) loss. We also integrate a variational autoencoder (VAE) and flow models for enhanced time-scaled signals. Experimental results show that Neural ATSM outperforms ATSM for real and synthesized speech.	-
dc.format.extent	5	-
dc.language	영어	-
dc.language.iso	ENG	-
dc.title	Neural ATSM: Fully Neural Network-based Adaptive Time-Scale Modification Using Sentence-Specific Dynamic Control	-
dc.type	Article	-
dc.identifier.doi	10.21437/Interspeech.2024-2380	-
dc.identifier.scopusid	2-s2.0-85214811259	-
dc.identifier.wosid	001331850105003	-
dc.identifier.bibliographicCitation	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp 4903 - 4907	-
dc.citation.title	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH	-
dc.citation.startPage	4903	-
dc.citation.endPage	4907	-
dc.type.docType	Proceedings Paper	-
dc.description.isOpenAccess	N	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalResearchArea	Computer Science	-
dc.relation.journalWebOfScienceCategory	Computer Science, Artificial Intelligence	-
dc.subject.keywordPlus	Speech enhancement	-
dc.subject.keywordAuthor	Adaptive time-scale modification	-
dc.subject.keywordAuthor	attention mechanism	-
dc.subject.keywordAuthor	Gaussian upsampling	-
dc.subject.keywordAuthor	speaking rate predictor	-
dc.identifier.url	https://www.isca-archive.org/interspeech_2024/lee24m_interspeech.html	-

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Chang, Joon-Hyuk photo

Chang, Joon-Hyuk: COLLEGE OF ENGINEERING (SCHOOL OF ELECTRONIC ENGINEERING)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE