Improved CNN-Transformer Using Broadcasted Residual Learning for Text-Independent Speaker Verification

Choi, Jeong-Hwan; Yang, Joon-Young; Jeoung, Ye-Rin; Chang, Joon-Hyuk

doi:10.21437/Interspeech.2022-88

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Improved CNN-Transformer Using Broadcasted Residual Learning for Text-Independent Speaker Verification

Full metadata record

DC Field	Value	Language
dc.contributor.author	Choi, Jeong-Hwan	-
dc.contributor.author	Yang, Joon-Young	-
dc.contributor.author	Jeoung, Ye-Rin	-
dc.contributor.author	Chang, Joon-Hyuk	-
dc.date.accessioned	2022-12-20T06:25:17Z	-
dc.date.available	2022-12-20T06:25:17Z	-
dc.date.created	2022-11-02	-
dc.date.issued	2022-09	-
dc.identifier.issn	2308-457X	-
dc.identifier.uri	https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/173091	-
dc.description.abstract	This study proposes a novel speaker embedding extractor architecture that effectively combines convolutional neural networks (CNNs) and Transformers. Based on the recently proposed CNNs-meet-vision-Transformers (CMT) architecture, we propose two strategies for efficient speaker embedding extraction modeling. First, we apply broadcast residual learning techniques to the building blocks of the CMT, allowing us to extract frequency-aware temporal features shared across frequency dimensions with a reduced set of parameters. Second, frequency-statistics-dependent attentive statistics pooling is proposed to aggregate attentive temporal statistics acquired from the means and standard deviations of input feature maps weighted along the frequency axis using an attention mechanism. The experimental results on the VoxCeleb-1 dataset show that the proposed model outperforms several CNN- and Transformer-based models with a similar number of model parameters. Moreover, the effectiveness of the proposed modifications to the CMT architecture is validated through ablation studies.	-
dc.language	영어	-
dc.language.iso	en	-
dc.publisher	International Speech Communication Association	-
dc.title	Improved CNN-Transformer Using Broadcasted Residual Learning for Text-Independent Speaker Verification	-
dc.type	Article	-
dc.contributor.affiliatedAuthor	Chang, Joon-Hyuk	-
dc.identifier.doi	10.21437/Interspeech.2022-88	-
dc.identifier.scopusid	2-s2.0-85140099933	-
dc.identifier.wosid	000900724502080	-
dc.identifier.bibliographicCitation	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, v.2022-September, pp.2223 - 2227	-
dc.relation.isPartOf	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH	-
dc.citation.title	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH	-
dc.citation.volume	2022-September	-
dc.citation.startPage	2223	-
dc.citation.endPage	2227	-
dc.type.rims	ART	-
dc.type.docType	Proceedings Paper	-
dc.description.journalClass	1	-
dc.description.isOpenAccess	N	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalResearchArea	Acoustics	-
dc.relation.journalResearchArea	Audiology & Speech-Language Pathology	-
dc.relation.journalResearchArea	Computer Science	-
dc.relation.journalResearchArea	Engineering	-
dc.relation.journalWebOfScienceCategory	Acoustics	-
dc.relation.journalWebOfScienceCategory	Audiology & Speech-Language Pathology	-
dc.relation.journalWebOfScienceCategory	Computer Science, Artificial Intelligence	-
dc.relation.journalWebOfScienceCategory	Engineering, Electrical & Electronic	-
dc.subject.keywordPlus	Convolutional neural networks	-
dc.subject.keywordPlus	Embeddings	-
dc.subject.keywordPlus	Network architecture	-
dc.subject.keywordPlus	Speech communication	-
dc.subject.keywordPlus	Speech recognition	-
dc.subject.keywordPlus	Deep neural networks	-
dc.subject.keywordPlus	Attentive statistic pooling	-
dc.subject.keywordPlus	Building blockes	-
dc.subject.keywordPlus	Convolutional neural network	-
dc.subject.keywordPlus	Embeddings	-
dc.subject.keywordPlus	Extraction modeling	-
dc.subject.keywordPlus	Hybrid deep neural network	-
dc.subject.keywordPlus	Learning techniques	-
dc.subject.keywordPlus	Temporal features	-
dc.subject.keywordPlus	Text-independent speaker verification	-
dc.subject.keywordPlus	Transformer	-
dc.subject.keywordAuthor	attentive statistics pooling	-
dc.subject.keywordAuthor	hybrid deep neural network	-
dc.subject.keywordAuthor	Text-independent speaker verification	-
dc.subject.keywordAuthor	Transformer	-
dc.identifier.url	https://www.isca-speech.org/archive/interspeech_2022/choi22_interspeech.html	-

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Chang, Joon-Hyuk photo

Chang, Joon-Hyuk: COLLEGE OF ENGINEERING (SCHOOL OF ELECTRONIC ENGINEERING)

Read more

Altmetrics

Total Views & Downloads

STATISTICS: Total View :6,007,935; Today View :34,677

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1365

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE