Task-Specific Optimization of Virtual Channel Linear Prediction-Based Speech Dereverberation Front-End for Far-Field Speaker Verification
- Authors
- Yang, Joon-Young; Chang, Joon-Hyuk
- Issue Date
- Sep-2022
- Publisher
- IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
- Keywords
- Noise reduction; Training; Noise measurement; Task analysis; Optimization; Microphones; Reverberation; Deep neural network; offline processing; speaker verification; speech dereverberation; single microphone; virtual acoustic channel expansion; weighted prediction error
- Citation
- IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, v.30, pp.3144 - 3159
- Indexed
- SCIE
SCOPUS
- Journal Title
- IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING
- Volume
- 30
- Start Page
- 3144
- End Page
- 3159
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/173113
- DOI
- 10.1109/TASLP.2022.3205752
- ISSN
- 2329-9290
- Abstract
- Developing a single-microphone speech denoising or dereverberation front-end for robust automatic speaker verification (ASV) in noisy far-field speaking scenarios is challenging. To address this problem, we present a novel front-end design that involves a recently proposed extension of the weighted prediction error (WPE) speech dereverberation algorithm, the virtual acoustic channel expansion (VACE)-WPE. It is demonstrated experimentally in this study that unlike the conventional WPE algorithm, the VACE-WPE can be explicitly trained to cancel out both late reverberation and background noise. To build the front-end, the VACE-WPE is first (pre)trained to preserve the noise components in the input signals and produce "noisy" dereverberated output signals, thus making the front-end to be inductively biased to preserve as much noise components as possible and perform dereverberation only. Subsequently, given a pretrained speaker embedding model, the VACE-WPE is additionally fine-tuned within a task-specific optimization (TSO) framework, causing the speaker embedding extracted from the processed signal to be similar to that extracted from the "noise-free" target signal. Consequently, the front-end is optimized not to perform unnecessarily excessive denoising, thus achieving "generally safe" dereverberation and denoising for far-field ASV. Moreover, to prevent the front-end from adversely affecting the unconstrained "in-the-wild" ASV performance under more general, non-far-field conditions, we propose a distortion regularization method within the TSO framework. The effectiveness of the proposed approach is verified on both far-field and in-the-wild ASV benchmarks, demonstrating its superiority over fully neural front-ends and other TSO methods in various cases.
- Files in This Item
-
Go to Link
- Appears in
Collections - 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.