Improved CNN-Transformer Using Broadcasted Residual Learning for Text-Independent Speaker Verification
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Choi, Jeong-Hwan | - |
dc.contributor.author | Yang, Joon-Young | - |
dc.contributor.author | Jeoung, Ye-Rin | - |
dc.contributor.author | Chang, Joon-Hyuk | - |
dc.date.accessioned | 2022-12-20T06:25:17Z | - |
dc.date.available | 2022-12-20T06:25:17Z | - |
dc.date.created | 2022-11-02 | - |
dc.date.issued | 2022-09 | - |
dc.identifier.issn | 2308-457X | - |
dc.identifier.uri | https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/173091 | - |
dc.description.abstract | This study proposes a novel speaker embedding extractor architecture that effectively combines convolutional neural networks (CNNs) and Transformers. Based on the recently proposed CNNs-meet-vision-Transformers (CMT) architecture, we propose two strategies for efficient speaker embedding extraction modeling. First, we apply broadcast residual learning techniques to the building blocks of the CMT, allowing us to extract frequency-aware temporal features shared across frequency dimensions with a reduced set of parameters. Second, frequency-statistics-dependent attentive statistics pooling is proposed to aggregate attentive temporal statistics acquired from the means and standard deviations of input feature maps weighted along the frequency axis using an attention mechanism. The experimental results on the VoxCeleb-1 dataset show that the proposed model outperforms several CNN- and Transformer-based models with a similar number of model parameters. Moreover, the effectiveness of the proposed modifications to the CMT architecture is validated through ablation studies. | - |
dc.language | 영어 | - |
dc.language.iso | en | - |
dc.publisher | International Speech Communication Association | - |
dc.title | Improved CNN-Transformer Using Broadcasted Residual Learning for Text-Independent Speaker Verification | - |
dc.type | Article | - |
dc.contributor.affiliatedAuthor | Chang, Joon-Hyuk | - |
dc.identifier.doi | 10.21437/Interspeech.2022-88 | - |
dc.identifier.scopusid | 2-s2.0-85140099933 | - |
dc.identifier.wosid | 000900724502080 | - |
dc.identifier.bibliographicCitation | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, v.2022-September, pp.2223 - 2227 | - |
dc.relation.isPartOf | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH | - |
dc.citation.title | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH | - |
dc.citation.volume | 2022-September | - |
dc.citation.startPage | 2223 | - |
dc.citation.endPage | 2227 | - |
dc.type.rims | ART | - |
dc.type.docType | Proceedings Paper | - |
dc.description.journalClass | 1 | - |
dc.description.isOpenAccess | N | - |
dc.description.journalRegisteredClass | scopus | - |
dc.relation.journalResearchArea | Acoustics | - |
dc.relation.journalResearchArea | Audiology & Speech-Language Pathology | - |
dc.relation.journalResearchArea | Computer Science | - |
dc.relation.journalResearchArea | Engineering | - |
dc.relation.journalWebOfScienceCategory | Acoustics | - |
dc.relation.journalWebOfScienceCategory | Audiology & Speech-Language Pathology | - |
dc.relation.journalWebOfScienceCategory | Computer Science, Artificial Intelligence | - |
dc.relation.journalWebOfScienceCategory | Engineering, Electrical & Electronic | - |
dc.subject.keywordPlus | Convolutional neural networks | - |
dc.subject.keywordPlus | Embeddings | - |
dc.subject.keywordPlus | Network architecture | - |
dc.subject.keywordPlus | Speech communication | - |
dc.subject.keywordPlus | Speech recognition | - |
dc.subject.keywordPlus | Deep neural networks | - |
dc.subject.keywordPlus | Attentive statistic pooling | - |
dc.subject.keywordPlus | Building blockes | - |
dc.subject.keywordPlus | Convolutional neural network | - |
dc.subject.keywordPlus | Embeddings | - |
dc.subject.keywordPlus | Extraction modeling | - |
dc.subject.keywordPlus | Hybrid deep neural network | - |
dc.subject.keywordPlus | Learning techniques | - |
dc.subject.keywordPlus | Temporal features | - |
dc.subject.keywordPlus | Text-independent speaker verification | - |
dc.subject.keywordPlus | Transformer | - |
dc.subject.keywordAuthor | attentive statistics pooling | - |
dc.subject.keywordAuthor | hybrid deep neural network | - |
dc.subject.keywordAuthor | Text-independent speaker verification | - |
dc.subject.keywordAuthor | Transformer | - |
dc.identifier.url | https://www.isca-speech.org/archive/interspeech_2022/choi22_interspeech.html | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1365
COPYRIGHT © 2021 HANYANG UNIVERSITY.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.