Multimodal Information Extraction from Visually Rich Documents with Adaptive Graph Integration Network | Zendy

Yinhuan Zheng | Zendy; Jiabao Chen | Zendy; Weiru Zhang | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Multimodal Information Extraction from Visually Rich Documents with Adaptive Graph Integration Network

Author(s) -

Yinhuan Zheng,

Jiabao Chen,

Weiru Zhang

Publication year - 2025

Publication title -

ieee access

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 0.587

H-Index - 127

eISSN - 2169-3536

DOI - 10.1109/access.2025.3619974

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

In the context of factory informatization, there is a growing need to automatically obtain information from diverse document images. Multimodal information extraction from visually rich documents (VRDs) remains challenging due to complex text layouts, local noise, and uneven text distribution. Methods such as LayoutLM obtain strong results but incur high computational costs and are not designed to be lightweight. Although GraphReviseIE is relatively lightweight, it still has limitations—for example, it insufficiently exploits spatial relational features.To address this issue, we propose an Adaptive Graph Integration Network (AGIN). Our model introduces a novel 2D spatial relative-orientation positional embedding that more effectively captures spatial relationships among text segments, enriching multimodal representations. We also design a dual-graph adaptive integration mechanism composed of (1) a self-revised graph that captures implicit structural dependencies via node similarity propagation, and (2) a relation-revised graph that encodes explicit semantic-spatial relations between text segments. These graphs are adaptively consolidated using learnable attention weights, which mitigates the stochastic variations commonly observed in traditional graph constructions. Additionally, we adopt a multitask learning framework. Experiments on multiple real-world datasets show that our approach improves information-extraction capability and achieves comparable or superior performance relative to baseline methods.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research