An Image Semantic Representation Method Based on Cross-Modal Adaptive Multi-Layer Perceptron | Zendy

Yang Liu | Zendy; Xiulei Liu | Zendy; Chengli Peng | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

An Image Semantic Representation Method Based on Cross-Modal Adaptive Multi-Layer Perceptron

Author(s) -

Yang Liu,

Xiulei Liu,

Chengli Peng

Publication year - 2025

Publication title -

ieee access

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 0.587

H-Index - 127

eISSN - 2169-3536

DOI - 10.1109/access.2025.3598598

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

With the development of Multimodal information research, the latest image captioning aims to generate variable-length sentences following cross-modal signals (e.g., visual and text). While Transformer-based methods have shown impressive progress for image captioning, they are complex and computationally expensive. Recent work has shown Multi-layer Perceptron’s (MLP) potential for lower computational cost. However, fixed-parameter weights of MLP make it difficult for models to adapt to producing variable-length sentences in cross-modal scenarios. To address these challenges, we propose the novel Cross-Modal Adaptive Network (CMANet) to improve MLP by dynamic parameter weight settings and bidirectional semantic alignment of cross-modal features, while enhancing computational efficiency. Specifically, a network architecture consisting of two independent fixed-weight two-layer MLP modules and a dynamic-weight bidirectional linear layer module was designed. Visual and textual features are first processed by two independent two-layer MLP modules to extract their intrinsic properties, thereby improving the training and computational efficiency of the model through a simple MLP structure. Subsequently, the processed visual and textual features are used as dynamic weight matrices for each other. Through the dynamic-weight bidirectional linear layer, CMANet can adapt to the requirement of generating variable-length sentences in cross-modal scenarios, addressing the issue of traditional MLPs being unable to handle variable text lengths in cross-modal feature relationships. Results on the MS-COCO dataset show that CMANet outperforms Transformer-based methods by 2.7% across the CIDEr score, while reducing the number of parameters and GFLOPs by 42.9% and 69.9%, respectively. In comparison to large-scale pre-trained models such as LEMON, CMANet achieves a performance improvement of 1.3%, while the number of parameters is only 3% of the LEMON.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research