
LightSTATE: A Generalized Framework for Real-Time Human Activity Detection Using Edge-Based Video Processing and Vision Language Models
Author(s) -
Anik Debnath,
Yong-Woon Kim,
Yung-Cheol Byun
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3574659
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Human activity detection plays a vital role in applications such as healthcare monitoring, smart environments, and security surveillance. However, traditional methods often rely on computationally intensive models, which are unsuitable for edge devices due to their limited resources and high latency. This research introduces a new approach named LightSTATE, a lightweight hybrid framework designed for real-time activity detection by integrating edge-based preprocessing with cloud-hosted Vision Language Models (VLMs). The system processes video streams through efficient frame extraction, edge detection and Root Mean Square Error (RMSE)-based frame filtering to identify significant motion events. These selected frames are transmitted to the cloud for semantic analysis and state recognition. The system utilizes GPT-4o as the vision-language model to generate rich, descriptive captions for human state understanding. LightSTATE was evaluated on a curated dataset of 9 real-world videos containing 634 frames, with 320 labeled frames used for testing. The framework demonstrated high accuracy in detecting predefined states, with an average precision of 0.98, an average recall of 0.98, and an average F1-score of 0.98. The experimental results confirm that the average frame processing time is approximately 0.8–0.9 seconds, ensuring suitability for real-time applications. Due to the scarcity of publicly accessible datasets, we compiled a custom dataset to effectively assess the performance of our proposed approach. The results validate the scalability and real-time capabilities of our proposed system, making it a promising solution for continuous monitoring applications. This study establishes LightSTATE as a scalable, efficient, and privacy-preserving framework for real-time human activity detection. Future work will focus on extending its applicability to dynamic and multi-person scenarios while enhancing system adaptability for diverse use cases.