DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation through Instruction-Tuned MLLM | Zendy

Yuanyuan Sun | Zendy; Ting Zhou | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation through Instruction-Tuned MLLM

Author(s) -

Yuanyuan Sun,

Ting Zhou

Publication year - 2025

Publication title -

ieee access

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 0.587

H-Index - 127

eISSN - 2169-3536

DOI - 10.1109/access.2025.3591447

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

Multimodal Emotion Recognition in Conversation (MERC) is an advanced research area that integrates cross-modal understanding and contextual reasoning through text-speech-visual fusion, with applications spanning diverse scenarios including student emotion monitoring in high school classroom interactions. Although existing research has made progress in multimodal alignment and dialogue relationship modeling through architectures such as graph neural networks and pre-trained language models, challenges persist in dataset overfitting and underexplored generative approaches. In this study, we pioneer a generative MERC framework based on Multimodal Large Language Models (MLLMs), employing Video-LLaMA, an open-source and advanced tri-modal foundation model, for end-to-end multimodal emotion reasoning. We employ carefully crafted structured prompts to align emotion semantics with dataset annotations, combined with LoRA for parameter-efficient optimization. Our method achieves a state-of-the-art weighted F1-score of 68.57% on the MELD benchmark. Further, exploratory experiments on dynamic modality combinations and fine-tuning strategies offer actionable insights for MLLM-based MERC research. This work not only advances emotion understanding in dialogues but also highlights MLLMs’ potential in complex multimodal reasoning tasks.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research