z-logo
open-access-imgOpen Access
M 4 SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition
Author(s) -
Jiajun He,
Xiaohan Shi,
Cheng-Hung Hu,
Jinyi Mi,
Xingfeng Li,
Tomoki Toda
Publication year - 2025
Publication title -
ieee transactions on audio, speech and language processing
Language(s) - English
Resource type - Magazines
eISSN - 2998-4173
DOI - 10.1109/taslpro.2025.3614428
Subject(s) - signal processing and analysis , computing and processing , fields, waves and electromagnetics
Multimodal speech emotion recognition (SER) has emerged as pivotal for improving human–machine interaction. Researchers are increasingly leveraging both speech and textual information obtained through automatic speech recognition (ASR) to comprehensively recognize emotional states from speakers. Although this approach reduces reliance on human-annotated text data, ASR errors possibly degrade emotion recognition performance. To address this challenge, in our previous work, we introduced two auxiliary tasks, namely, ASR error detection and ASR error correction, and we proposed a novel multimodal fusion (MF) method for learning modality-specific and modality-invariant representations across different modalities. Building on this foundation, in this paper, we introduce two additional training strategies. First, we propose an adversarial network to enhance the diversity of modality-specific representations. Second, we introduce a label-based contrastive learning strategy to better capture emotional features. We refer to our proposed method as M 4 SER and validate its superiority over state-of-the-art methods through extensive experiments using IEMOCAP and MELD datasets.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom