
An Image Semantic Representation Method Based on Cross-Modal Adaptive Multi-Layer Perceptron
Author(s) -
Yang Liu,
Xiulei Liu,
Chengli Peng
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3598598
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
With the development of Multimodal information research, the latest image captioning aims to generate variable-length sentences following cross-modal signals (e.g., visual and text). While Transformer-based methods have shown impressive progress for image captioning, they are complex and computationally expensive. Recent work has shown Multi-layer Perceptron’s (MLP) potential for lower computational cost. However, fixed-parameter weights of MLP make it difficult for models to adapt to producing variable-length sentences in cross-modal scenarios. To address these challenges, we propose the novel Cross-Modal Adaptive Network (CMANet) to improve MLP by dynamic parameter weight settings and bidirectional semantic alignment of cross-modal features, while enhancing computational efficiency. Specifically, a network architecture consisting of two independent fixed-weight two-layer MLP modules and a dynamic-weight bidirectional linear layer module was designed. Visual and textual features are first processed by two independent two-layer MLP modules to extract their intrinsic properties, thereby improving the training and computational efficiency of the model through a simple MLP structure. Subsequently, the processed visual and textual features are used as dynamic weight matrices for each other. Through the dynamic-weight bidirectional linear layer, CMANet can adapt to the requirement of generating variable-length sentences in cross-modal scenarios, addressing the issue of traditional MLPs being unable to handle variable text lengths in cross-modal feature relationships. Results on the MS-COCO dataset show that CMANet outperforms Transformer-based methods by 2.7% across the CIDEr score, while reducing the number of parameters and GFLOPs by 42.9% and 69.9%, respectively. In comparison to large-scale pre-trained models such as LEMON, CMANet achieves a performance improvement of 1.3%, while the number of parameters is only 3% of the LEMON.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom