z-logo
open-access-imgOpen Access
An Image Semantic Representation Method Based on Cross-Modal Adaptive Multi-Layer Perceptron
Author(s) -
Yang Liu,
Xiulei Liu,
Chengli Peng
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3598598
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
With the development of Multimodal information research, the latest image captioning aims to generate variable-length sentences following cross-modal signals (e.g., visual and text). While Transformer-based methods have shown impressive progress for image captioning, they are complex and computationally expensive. Recent work has shown Multi-layer Perceptron’s (MLP) potential for lower computational cost. However, fixed-parameter weights of MLP make it difficult for models to adapt to producing variable-length sentences in cross-modal scenarios. To address these challenges, we propose the novel Cross-Modal Adaptive Network (CMANet) to improve MLP by dynamic parameter weight settings and bidirectional semantic alignment of cross-modal features, while enhancing computational efficiency. Specifically, a network architecture consisting of two independent fixed-weight two-layer MLP modules and a dynamic-weight bidirectional linear layer module was designed. Visual and textual features are first processed by two independent two-layer MLP modules to extract their intrinsic properties, thereby improving the training and computational efficiency of the model through a simple MLP structure. Subsequently, the processed visual and textual features are used as dynamic weight matrices for each other. Through the dynamic-weight bidirectional linear layer, CMANet can adapt to the requirement of generating variable-length sentences in cross-modal scenarios, addressing the issue of traditional MLPs being unable to handle variable text lengths in cross-modal feature relationships. Results on the MS-COCO dataset show that CMANet outperforms Transformer-based methods by 2.7% across the CIDEr score, while reducing the number of parameters and GFLOPs by 42.9% and 69.9%, respectively. In comparison to large-scale pre-trained models such as LEMON, CMANet achieves a performance improvement of 1.3%, while the number of parameters is only 3% of the LEMON.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom