Multimodal Prompt Learning in Emotion Recognition Using Context and Audio Information

Jeong, Eunseo; Kim, Gyunyeop; Kang, Sangwoo

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Multimodal Prompt Learning in Emotion Recognition Using Context and Audio Informationopen access

Authors: Jeong, Eunseo; Kim, Gyunyeop; Kang, Sangwoo

Issue Date: Jul-2023

Publisher: MDPI

Keywords: multimodal; prompt learning; speech emotion recognition; audio processing; natural language processing

Citation: MATHEMATICS, v.11, no.13

Journal Title: MATHEMATICS

Volume: 11

Number: 13

URI: https://scholarworks.bwise.kr/gachon/handle/2020.sw.gachon/88766

DOI: 10.3390/math11132908

ISSN: 2227-7390

Abstract: Prompt learning has improved the performance of language models by reducing the gap in language model training methods of pre-training and downstream tasks. However, extending prompt learning in language models pre-trained with unimodal data to multimodal sources is difficult as it requires additional deep-learning layers that cannot be attached. In the natural-language emotion-recognition task, improved emotional classification can be expected when using audio and text to train a model rather than only natural-language text. Audio information, such as voice pitch, tone, and intonation, can give more information that is unavailable in text to predict emotions more effectively. Thus, using both audio and text can enable better emotion prediction in speech emotion-recognition models compared to semantic information alone. In this paper, in contrast to existing studies that use multimodal data with an additional layer, we propose a method for improving the performance of speech emotion recognition using multimodal prompt learning with text-based pre-trained models. The proposed method is using text and audio information in prompt learning by employing a language model pre-trained on natural-language text. In addition, we propose a method to improve the emotion-recognition performance of the current utterance using the emotion and contextual information of the previous utterances for prompt learning in speech emotion-recognition tasks. The performance of the proposed method was evaluated using the English multimodal dataset MELD and the Korean multimodal dataset KEMDy20. Experiments using both the proposed methods obtained an accuracy of 87.49%, F1 score of 44.16, and weighted F1 score of 86.28.

Files in This Item: There are no files associated with this item.

Appears in Collections: IT융합대학 > 소프트웨어학과 > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Kang, Sang Woo photo

Kang, Sang Woo: College of IT Convergence (Department of Software)

Read more

Altmetrics

Total Views & Downloads

STATISTICS: Total View :4,149,905; Today View :11,237

RSS_1.0 RSS_2.0 ATOM_1.0

1342, Seongnam-daero, Sujeong-gu, Seongnam-si, Gyeonggi-do, Republic of Korea(13120)031-750-5114

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE