Multimodal Fine-Tuning of LLMs for Robust Document Visual Question Answering
Author(s) -
Sahil Tripathi,
Md Tabrez Nafis,
Imran Hussain,
Abdul Khader Jilani Saudagar
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3615201
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Document Visual Question Answering (DocVQA) necessitates comprehension of both the spatial layout and the textual content. Multimodal pretraining is a foundational component of existing vision-language models, including LayoutLM. However, they frequently lack integration with potent Large Language Models (LLMs). This work addresses this gap by fine-tuning Flan-T5 on the SP-DocVQA dataset using both text and bounding box information across multiple context categories. This spatial-textual alignment allows the model to attain an ANLS score of 76% solely through the text modality. In order to integrate visual comprehension, we implement a multimodal pipeline that coordinates cropped word images with the LLM embedding space through a novel pretraining task. Additionally, we present two DocVQA strategies that incorporate visual word embeddings to improve document comprehension. Empirical findings indicate that models utilizing bounding box information substantially surpass those employing text-only or layout-aware inputs, especially for spatially-grounded inquiries. In a pre-task evaluation, PT2 outperforms PT1 with significant enhancements in ANLS (+20%) and Accuracy (+24%), however it exhibits a minor decline in GTIP (–6.1%).
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom