Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Searching for effective preprocessing method and CNN based architecture with efficient channel attention on speech emotion recognitionopen access

Authors
Kim, ByunggunKwon, Younghun
Issue Date
Sep-2025
Publisher
Nature Research
Keywords
Convolutional neural network; Data augmentation; Efficient channel attention; Log-Mel spectrogram; Speech emotion recognition
Citation
Scientific Reports, v.15, no.1
Indexed
SCIE
SCOPUS
Journal Title
Scientific Reports
Volume
15
Number
1
URI
https://scholarworks.bwise.kr/erica/handle/2021.sw.erica/126740
DOI
10.1038/s41598-025-19887-7
ISSN
2045-2322
2045-2322
Abstract
Recently, Speech emotion recognition (SER) performance has steadily increased as multiple deep learning architectures have adapted. Especially, convolutional neural network (CNN) models with spectrogram data preprocessing are the most popular approach in the SER. However, designing an effective and efficient preprocessing method and a CNN-based model for SER is still ambiguous. Therefore, it needs to search for more concrete preprocessing methods and a CNN-based model for SER. First, to search for a proper frequency-time resolution for SER, we prepare eight different datasets with preprocessing settings. Furthermore, to compensate for the lack of emotional feature resolution, we propose multiple short-term Fourier transform (STFT) preprocessing data augmentation that augments trainable data with all different sizes of windows. Next, because CNN’s channel filters are core to detecting hidden input features, we focus on the channel filters’ effectiveness on SER. To do so, we design several types of architecture that contain a 6-layer CNN model. Also, with efficient channel attention (ECA) that is well known to improve channel feature representation with only a few parameters, we find that it can more efficiently train the channel filters for SER. With two different SER datasets (Interactive Emotional Dyadic Motion Capture, Berlin Emotional Speech Database), increasing the frequency resolution in preprocessing emotional speech can improve emotion recognition performance. Consequently, the CNN-based model with only two ECA blocks can exceed the performance of previous SER models. Especially, with STFT data augmentation, our proposed model achieves the highest performance on SER.
Files in This Item
Go to Link
Appears in
Collections
ETC > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Kwon, Young hun photo

Kwon, Young hun
ERICA 첨단융합대학 (ERICA 지능정보양자공학전공)
Read more

Altmetrics

Total Views & Downloads

BROWSE