
Modality-Invariant and -Specific Representations with Crossmodal Transformer for Multimodal Sentiment Analysis
Author(s) -
Qishang Shan,
Xiangsen Wei,
Ziyun Cai
Publication year - 2022
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/2224/1/012024
Subject(s) - crossmodal , computer science , modalities , transformer , artificial intelligence , modal , natural language processing , complementarity (molecular biology) , fusion , speech recognition , linguistics , psychology , engineering , perception , neuroscience , visual perception , social science , chemistry , voltage , sociology , biology , polymer chemistry , electrical engineering , genetics , philosophy
Human emotion judgments usually receive information from multiple modalities such as language, audio, as well as facial expressions and gestures. Because different modalities are represented differently, multimodal data exhibit redundancy and complementarity, so a reasonable multimodal fusion approach is essential to improve the accuracy of sentiment analysis. Inspired by the Crossmodal Transformer for multimodal data fusion in the MulT (Multimodal Transformer) model, this paper adds the Crossmodal transformer for modal enhancement of different modal data in the fusion part of the MISA (Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis) model, and proposes three MISA-CT models. Tested on two publicly available multimodal sentiment analysis datasets MOSI and MOSEI, the experimental results of the models outperformed the original MISA model.