LightSTATE: A Generalized Framework for Real-Time Human Activity Detection Using Edge-Based Video Processing and Vision Language Models | Zendy

Anik Debnath | Zendy; Yong-Woon Kim | Zendy; Yung-Cheol Byun | Zendy

Open Access

LightSTATE: A Generalized Framework for Real-Time Human Activity Detection Using Edge-Based Video Processing and Vision Language Models

Author(s) -

Anik Debnath,

Yong-Woon Kim,

Yung-Cheol Byun

Publication year - 2025

Publication title -

ieee access

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 0.587

H-Index - 127

eISSN - 2169-3536

DOI - 10.1109/access.2025.3574659

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

Human activity detection plays a vital role in applications such as healthcare monitoring, smart environments, and security surveillance. However, traditional methods often rely on computationally intensive models, which are unsuitable for edge devices due to their limited resources and high latency. This research introduces a new approach named LightSTATE, a lightweight hybrid framework designed for real-time activity detection by integrating edge-based preprocessing with cloud-hosted Vision Language Models (VLMs). The system processes video streams through efficient frame extraction, edge detection and Root Mean Square Error (RMSE)-based frame filtering to identify significant motion events. These selected frames are transmitted to the cloud for semantic analysis and state recognition. The system utilizes GPT-4o as the vision-language model to generate rich, descriptive captions for human state understanding. LightSTATE was evaluated on a curated dataset of 9 real-world videos containing 634 frames, with 320 labeled frames used for testing. The framework demonstrated high accuracy in detecting predefined states, with an average precision of 0.98, an average recall of 0.98, and an average F1-score of 0.98. The experimental results confirm that the average frame processing time is approximately 0.8–0.9 seconds, ensuring suitability for real-time applications. Due to the scarcity of publicly accessible datasets, we compiled a custom dataset to effectively assess the performance of our proposed approach. The results validate the scalability and real-time capabilities of our proposed system, making it a promising solution for continuous monitoring applications. This study establishes LightSTATE as a scalable, efficient, and privacy-preserving framework for real-time human activity detection. Future work will focus on extending its applicability to dynamic and multi-person scenarios while enhancing system adaptability for diverse use cases.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore