Improved Speech Emotion Recognition Focusing on High-Level Data Representations and Swift Feature Extraction Calculation

Abdusalomov, Akmalbek; Kutlimuratov, Alpamis; Nasimov, Rashid; Whangbo, Taeg Keun

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Improved Speech Emotion Recognition Focusing on High-Level Data Representations and Swift Feature Extraction Calculationopen access

Authors: Abdusalomov, Akmalbek; Kutlimuratov, Alpamis; Nasimov, Rashid; Whangbo, Taeg Keun

Issue Date: Dec-2023

Publisher: TECH SCIENCE PRESS

Keywords: Feature extraction; MFCC; ResNet; speech emotion recognition

Citation: CMC-COMPUTERS MATERIALS & CONTINUA, v.77, no.3, pp 2915 - 2933

Pages: 19

Journal Title: CMC-COMPUTERS MATERIALS & CONTINUA

Volume: 77

Number: 3

Start Page: 2915

End Page: 2933

URI: https://scholarworks.bwise.kr/gachon/handle/2020.sw.gachon/90691

DOI: 10.32604/cmc.2023.044466

ISSN: 1546-2218
1546-2226

Abstract: The performance of a speech emotion recognition (SER) system is heavily influenced by the efficacy of its feature extraction techniques. The study was designed to advance the field of SER by optimizing feature extraction techniques, specifically through the incorporation of high-resolution Mel-spectrograms and the expedited calculation of Mel Frequency Cepstral Coefficients (MFCC). This initiative aimed to refine the system's accuracy by identifying and mitigating the shortcomings commonly found in current approaches. Ultimately, the primary objective was to elevate both the intricacy and effectiveness of our SER model, with a focus on augmenting its proficiency in the accurate identification of emotions in spoken language. The research employed a dual-strategy approach for feature extraction. Firstly, a rapid computation technique for MFCC was implemented and integrated with a Bi-LSTM layer to optimize the encoding of MFCC features. Secondly, a pretrained ResNet model was utilized in conjunction with feature Stats pooling and dense layers for the effective encoding of Mel-spectrogram attributes. These two sets of features underwent separate processing before being combined in a Convolutional Neural Network (CNN) outfitted with a dense layer, with the aim of enhancing their representational richness. The model was rigorously evaluated using two prominent databases: CMU-MOSEI and RAVDESS. Notable findings include an accuracy rate of 93.2% on the CMU-MOSEI database and 95.3% on the RAVDESS database. Such exceptional performance underscores the efficacy of this innovative approach, which not only meets but also exceeds the accuracy benchmarks established by traditional models in the field of speech emotion recognition.

Files in This Item: There are no files associated with this item.

Appears in Collections: ETC > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Whangbo, Taeg Keun photo

Whangbo, Taeg Keun: College of IT Convergence (컴퓨터공학부(컴퓨터공학전공))

Read more

Altmetrics

Total Views & Downloads

STATISTICS: Total View :4,244,592; Today View :1,885

RSS_1.0 RSS_2.0 ATOM_1.0

1342, Seongnam-daero, Sujeong-gu, Seongnam-si, Gyeonggi-do, Republic of Korea(13120)031-750-5114

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE