Efficient Speaker Embedding Extraction Using a Twofold Sliding Window Algorithm for Speaker Diarization
- Authors
- Choi, Jeong-Hwan; Jeoung, Ye-Rin; Kim, Ilseok; Chang, Joon-Hyuk
- Issue Date
- Sep-2024
- Keywords
- segmentation; sliding window algorithm; speaker diarization; speaker embedding
- Citation
- Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp 3749 - 3753
- Pages
- 5
- Indexed
- SCOPUS
- Journal Title
- Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
- Start Page
- 3749
- End Page
- 3753
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/206483
- DOI
- 10.21437/Interspeech.2024-1874
- ISSN
- 1990-9772
- Abstract
- This paper proposes an efficient speaker embedding (SE) extraction method that employs a twofold sliding window algorithm (SWA) for speaker diarization (SD) systems. Non-overlapping short segments are obtained through the first SWA and fed into the frame-level neural networks of a pre-trained SE model to extract frame-level representations. The neighboring frame-level representations are concatenated along the time axis through the second SWA, which enables an overlap between representations. The concatenated representations are used to extract multiple SEs. Additionally, we propose a fine-tuning strategy that employs a residual adapter and knowledge distillation techniques on a pre-trained SE model to refine the frame-level representation. Experimental results using two SD benchmarks show the effectiveness of the proposed extraction method with a fine-tuned SE model in terms of floating-point operations while maintaining the diarization error rate.
- Files in This Item
-
Go to Link
- Appears in
Collections - 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.