A Multi-Modal Approach Using a Hybrid Vision Transformer and Temporal Fusion Transformer Model for Stock Price Movement Classification | Zendy

Ibanga Kpereobong Friday | Zendy; Sarada Prasanna Pati | Zendy; Debahuti Mishra | Zendy

Open Access

A Multi-Modal Approach Using a Hybrid Vision Transformer and Temporal Fusion Transformer Model for Stock Price Movement Classification

Author(s) -

Ibanga Kpereobong Friday,

Sarada Prasanna Pati,

Debahuti Mishra

Publication year - 2025

Publication title -

ieee access

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 0.587

H-Index - 127

eISSN - 2169-3536

DOI - 10.1109/access.2025.3589063

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

Stock market price movement primarily focuses on accurately classifying buy and sell signals, which enables traders to maximize profits with well-timed market entry and exit trading positions. This study presents and implements a multi-modal deep learning approach to classifying stock price movement. Our approach adequately captures potential price reversals or continuations by utilizing two modalities (candlestick chart patterns and historical price data). Specifically, the proposed framework converts the historical data into candlestick charts of 256 × 256 pixel images where both modalities are effectively integrated and processed. A key innovation employed is the application of the histogram of oriented gradients (HOG) to extract relevant descriptors, including the candlestick colour, body-to-wick proportions, and wick size. Concurrently, the vision transformer (ViT) model is used to process the images using an embedded projection and multi-head self-attention to extract salient spatial features into a non-overlapping patch of 16 × 16 pixels, which are treated as input tokens for the model. After which, the temporal fusion transformer (TFT) model processes the historical features, candlestick chart features, and the extracted HOG features via a decision-level (late feature fusion) strategy that concatenates these inputs to predict short-term price movements over different horizons (1 day, 3 days, 7 days, and 10 days ahead). We systematically evaluate the model performance using a time series cross-validation split to demonstrate the proposed model's efficacy and generalization across eight indices (BSE, IXIC, N225, NIFTY-50, NSE-30, NYSE, S&P 500, and SSE). The results demonstrate the superior performance of our multi-modal approach, achieving average accuracy, precision, recall, and matthew correlation coefficient (MCC) of 96.17%, 96.24%, 96.15%, and 0.9367, respectively across all evaluated indices. Furthermore, using a real-time trading simulation, the study assesses the practical implications of different window sizes (5, 10, and 15 days). A paired t-test is also conducted to validate the proposed model against benchmarks statistically. The analysis provides valuable insights into how short and long-term traders can effectively maximize the proposed model, highlighting its adaptability for real-world applications.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research