Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Neural ATSM: Fully Neural Network-based Adaptive Time-Scale Modification Using Sentence-Specific Dynamic Control

Full metadata record
DC Field Value Language
dc.contributor.authorLee, Jaeuk-
dc.contributor.authorJang, Sohee-
dc.contributor.authorChang, Joon-Hyuk-
dc.date.accessioned2025-02-12T08:00:33Z-
dc.date.available2025-02-12T08:00:33Z-
dc.date.issued2024-09-
dc.identifier.issn1990-9772-
dc.identifier.urihttps://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/206471-
dc.description.abstractAdaptive time-scale modification (ATSM) adaptively adjusts audio speed and improves upon previous systems by tailoring the scale for each phoneme in two steps: phoneme positioning via Montreal forced aligner (MFA) and reconstruction with adaptive speaking rate. However, ATSM's phoneme-specific rate is constant regardless of sentences, and MFA struggles with precise phoneme alignment in synthetic speech. Driven by this, we propose a fully neural networks-based ATSM (Neural ATSM) that dynamically controls each phoneme's speaking rate to vary from sentence to sentence. It predicts phoneme-level rates using a speaking rate predictor and flexibly modifies the scales to fit sentence context using Gaussian upsampling and attention mechanism, ensuring feature similarity with Soft-dynamic time warping (DTW) loss. We also integrate a variational autoencoder (VAE) and flow models for enhanced time-scaled signals. Experimental results show that Neural ATSM outperforms ATSM for real and synthesized speech.-
dc.format.extent5-
dc.language영어-
dc.language.isoENG-
dc.titleNeural ATSM: Fully Neural Network-based Adaptive Time-Scale Modification Using Sentence-Specific Dynamic Control-
dc.typeArticle-
dc.identifier.doi10.21437/Interspeech.2024-2380-
dc.identifier.scopusid2-s2.0-85214811259-
dc.identifier.wosid001331850105003-
dc.identifier.bibliographicCitationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp 4903 - 4907-
dc.citation.titleProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH-
dc.citation.startPage4903-
dc.citation.endPage4907-
dc.type.docTypeProceedings Paper-
dc.description.isOpenAccessN-
dc.description.journalRegisteredClassscopus-
dc.relation.journalResearchAreaComputer Science-
dc.relation.journalWebOfScienceCategoryComputer Science, Artificial Intelligence-
dc.subject.keywordPlusSpeech enhancement-
dc.subject.keywordAuthorAdaptive time-scale modification-
dc.subject.keywordAuthorattention mechanism-
dc.subject.keywordAuthorGaussian upsampling-
dc.subject.keywordAuthorspeaking rate predictor-
dc.identifier.urlhttps://www.isca-archive.org/interspeech_2024/lee24m_interspeech.html-
Files in This Item
Go to Link
Appears in
Collections
서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Chang, Joon-Hyuk photo

Chang, Joon-Hyuk
COLLEGE OF ENGINEERING (SCHOOL OF ELECTRONIC ENGINEERING)
Read more

Altmetrics

Total Views & Downloads

BROWSE