
DATC-STP: Towards Accurate yet Efficient Spatiotemporal Prediction with Transformer-style CNN
Author(s) -
Hyeonseok Jin,
Kyungbaek Kim
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3573639
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Recently, convolutional neural networks (CNNs) or vision transformers (ViTs) based Multi-In-Multi-Out (MIMO) architectures are proposed to overcome the limitations of recurrent neural networks (RNNs) based Single-In-Single-Out (SISO) architectures. These architectures prevent the inherent limitations of RNNs, which degrade performance and inefficiency of parallelization due to the sequential properties. However, there are still some challenges. CNN-based MIMO architectures have difficulty capturing global spatiotemporal information due to the local properties of its kernel. Meanwhile, ViT-based MIMO architectures have difficulty capturing local spatiotemporal information and require high-computational resource due to the self-attention. To improve MIMO architecture with overcome these limitations, we propose a novel accurate yet efficient Dual-Attention Transformer-style CNN for Spatiotemporal Prediction (DATC-STP).DATC-STP captures both local and global spatiotemporal information by 3D patch embedding and Transformer-style CNN. Specifically, 3D patch embedding extract local spatiotemporal features and reduce the size of input data including temporal, height, and width. Two Transformer-style CNN based attention blocks treat spatiotemporal data similarly with image and capture global information with CNNs. These structure makes DATC-STP accurate yet efficient. To demonstrate the effectiveness of DATC-STP, we conduct comprehensive experiments with three promising benchmark datasets, MovingMNIST, TaxiBJ, and KTH.We evaluated that the proposed DATC-STP achieves both competitive performance and efficient. Furthermore, results of ablation study demonstrates the useful for each component of DATC-STP and highlights the potential of proposed methods.