Neural ATSM: Fully Neural Network-based Adaptive Time-Scale Modification Using Sentence-Specific Dynamic Control
- Authors
- Lee, Jaeuk; Jang, Sohee; Chang, Joon-Hyuk
- Issue Date
- Sep-2024
- Keywords
- Adaptive time-scale modification; attention mechanism; Gaussian upsampling; speaking rate predictor
- Citation
- Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp 4903 - 4907
- Pages
- 5
- Indexed
- SCOPUS
- Journal Title
- Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
- Start Page
- 4903
- End Page
- 4907
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/206471
- DOI
- 10.21437/Interspeech.2024-2380
- ISSN
- 1990-9772
- Abstract
- Adaptive time-scale modification (ATSM) adaptively adjusts audio speed and improves upon previous systems by tailoring the scale for each phoneme in two steps: phoneme positioning via Montreal forced aligner (MFA) and reconstruction with adaptive speaking rate. However, ATSM's phoneme-specific rate is constant regardless of sentences, and MFA struggles with precise phoneme alignment in synthetic speech. Driven by this, we propose a fully neural networks-based ATSM (Neural ATSM) that dynamically controls each phoneme's speaking rate to vary from sentence to sentence. It predicts phoneme-level rates using a speaking rate predictor and flexibly modifies the scales to fit sentence context using Gaussian upsampling and attention mechanism, ensuring feature similarity with Soft-dynamic time warping (DTW) loss. We also integrate a variational autoencoder (VAE) and flow models for enhanced time-scaled signals. Experimental results show that Neural ATSM outperforms ATSM for real and synthesized speech.
- Files in This Item
-
Go to Link
- Appears in
Collections - 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.