z-logo
open-access-imgOpen Access
DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation through Instruction-Tuned MLLM
Author(s) -
Yuanyuan Sun,
Ting Zhou
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3591447
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Multimodal Emotion Recognition in Conversation (MERC) is an advanced research area that integrates cross-modal understanding and contextual reasoning through text-speech-visual fusion, with applications spanning diverse scenarios including student emotion monitoring in high school classroom interactions. Although existing research has made progress in multimodal alignment and dialogue relationship modeling through architectures such as graph neural networks and pre-trained language models, challenges persist in dataset overfitting and underexplored generative approaches. In this study, we pioneer a generative MERC framework based on Multimodal Large Language Models (MLLMs), employing Video-LLaMA, an open-source and advanced tri-modal foundation model, for end-to-end multimodal emotion reasoning. We employ carefully crafted structured prompts to align emotion semantics with dataset annotations, combined with LoRA for parameter-efficient optimization. Our method achieves a state-of-the-art weighted F1-score of 68.57% on the MELD benchmark. Further, exploratory experiments on dynamic modality combinations and fine-tuning strategies offer actionable insights for MLLM-based MERC research. This work not only advances emotion understanding in dialogues but also highlights MLLMs’ potential in complex multimodal reasoning tasks.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom