Efficient Lightweight Speaker Verification With Broadcasting CNN-Transformer and Knowledge Distillation Training of Self-Attention Maps

Choi, Jeong-Hwan; Yang, Joon-Young; Chang, Joon-Hyuk

doi:10.1109/TASLP.2024.3463491

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Efficient Lightweight Speaker Verification With Broadcasting CNN-Transformer and Knowledge Distillation Training of Self-Attention Maps

Full metadata record

DC Field	Value	Language
dc.contributor.author	Choi, Jeong-Hwan	-
dc.contributor.author	Yang, Joon-Young	-
dc.contributor.author	Chang, Joon-Hyuk	-
dc.date.accessioned	2026-05-12T06:00:10Z	-
dc.date.available	2026-05-12T06:00:10Z	-
dc.date.issued	2024-09	-
dc.identifier.issn	2329-9290	-
dc.identifier.issn	2329-9304	-
dc.identifier.uri	https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/212717	-
dc.description.abstract	Developing a lightweight speaker embedding extractor (SEE) is crucial for the practical implementation of automatic speaker verification (ASV) systems. To this end, we recently introduced broadcasting convolutional neural networks (CNNs)-meet-vision-Transformers (BC-CMT), a lightweight SEE that utilizes broadcasted residual learning (BRL) within the hybrid CNN-Transformer architecture to maintain a small number of model parameters. We proposed three BC-CMT-based SEE with three different sizes: BC-CMT-Tiny, -Small, and -Base. In this study, we extend our previously proposed BC-CMT by introducing an improved model architecture and a training strategy based on knowledge distillation (KD) using self-attention (SA) maps. First, to reduce the computational costs and latency of the BC-CMT, the two-dimensional (2D) SA operations in the BC-CMT, which calculate the SA maps in the frequency–time dimensions, are simplified to 1D SA operations that consider only temporal importance. Moreover, to enhance the SA capability of the BC-CMT, the group convolution layers in the SA block are adjusted to have smaller number of groups and are combined with the BRL operations. Second, to improve the training effectiveness of the modified BC-CMT-Tiny, the SA maps of a pretrained large BC-CMT-Base are used for the KD to guide those of a smaller BC-CMT-Tiny. Because the attention map sizes of the modified BC-CMT models do not depend on the number of frequency bins or convolution channels, the proposed strategy enables KD between feature maps with different sizes. The experimental results demonstrate that the proposed BC-CMT-Tiny model having 271.44K model parameters achieved 36.8% and 9.3% reduction in floating point operations on 1s signals and equal error rate (EER) on VoxCeleb 1 testset, respectively, compared to the conventional BC-CMT-Tiny. The CPU and GPU running time of the proposed BC-CMT-Tiny ranges of 1 to 10 s signals were 29.07 to 146.32 ms and 36.01 to 206.43 ms, respectively. The proposed KD further reduced the EER by 15.5% with improved attention capability.	-
dc.format.extent	16	-
dc.language	영어	-
dc.language.iso	ENG	-
dc.publisher	IEEE Advancing Technology for Humanity	-
dc.title	Efficient Lightweight Speaker Verification With Broadcasting CNN-Transformer and Knowledge Distillation Training of Self-Attention Maps	-
dc.type	Article	-
dc.publisher.location	미국	-
dc.identifier.doi	10.1109/TASLP.2024.3463491	-
dc.identifier.scopusid	2-s2.0-85205021029	-
dc.identifier.wosid	001342474600002	-
dc.identifier.bibliographicCitation	IEEE/ACM Transactions on Audio, Speech, and Language Processing, v.32, pp 4580 - 4595	-
dc.citation.title	IEEE/ACM Transactions on Audio, Speech, and Language Processing	-
dc.citation.volume	32	-
dc.citation.startPage	4580	-
dc.citation.endPage	4595	-
dc.type.docType	Article	-
dc.description.isOpenAccess	N	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalResearchArea	Acoustics	-
dc.relation.journalResearchArea	Engineering	-
dc.relation.journalWebOfScienceCategory	Acoustics	-
dc.relation.journalWebOfScienceCategory	Engineering, Electrical & Electronic	-
dc.subject.keywordPlus	Binary images	-
dc.subject.keywordPlus	Cellular arrays	-
dc.subject.keywordPlus	Convolutional neural networks	-
dc.subject.keywordPlus	Depth perception	-
dc.subject.keywordPlus	Flow visualization	-
dc.subject.keywordPlus	Graphics processing unit	-
dc.subject.keywordPlus	Image coding	-
dc.subject.keywordPlus	Image compression	-
dc.subject.keywordPlus	Image segmentation	-
dc.subject.keywordPlus	Inference engines	-
dc.subject.keywordPlus	Multilayer neural networks	-
dc.subject.keywordPlus	Personnel training	-
dc.subject.keywordPlus	Photomapping	-
dc.subject.keywordPlus	Radial basis function networks	-
dc.subject.keywordPlus	System-on-chip	-
dc.subject.keywordAuthor	Automatic speaker verification	-
dc.subject.keywordAuthor	knowledge distillation	-
dc.subject.keywordAuthor	lightweight model	-
dc.subject.keywordAuthor	speaker embedding extractor	-
dc.identifier.url	https://ieeexplore.ieee.org/document/10683974	-

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Chang, Joon-Hyuk photo

Chang, Joon-Hyuk: COLLEGE OF ENGINEERING (SCHOOL OF ELECTRONIC ENGINEERING)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE