Human Action Recognition Based on Integrating Body Pose, Part Shape, and Motion | Zendy

Hany El-Ghaish | Zendy; Mohamed E. Hussien | Zendy; Amin Shoukry | Zendy; Rikio Onai | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Human Action Recognition Based on Integrating Body Pose, Part Shape, and Motion

Author(s) -

Hany El-Ghaish,

Mohamed E. Hussien,

Amin Shoukry,

Rikio Onai

Publication year - 2018

Publication title -

ieee access

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.587

H-Index - 127

ISSN - 2169-3536

DOI - 10.1109/access.2018.2868319

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

Human action recognition is a challenging problem, especially in the presence of multiple actors in the scene and/or viewpoint variations. In this paper, three modalities, namely, 3-D skeletons, body part images, and motion history image (MHI), are integrated into a hybrid deep learning architecture for human action recognition. The three modalities capture the main aspects of an action: body pose, part shape, and body motion. Although the 3-D skeleton modality captures the actor's pose, it lacks information about the shape of the body parts as well as the shape of manipulated objects. This is the reason for including both the body-part images and the MHI as additional modalities. The deployed architecture combines convolution neural networks (CNNs), long short-term memory (LSTM), and a fine-tuned pre-trained architecture into a hybrid one. It is called MCLP: multi-modal CNN + LSTM + VGG16 pre-trained on ImageNet. The MCLP consists of three sub-models: CL1D (for CNN1D + LSTM), CL2D (for CNN2D + LSTM), and CMHI (CNN2D for MHI), which simultaneously extract the spatial and temporal patterns in the three modalities. The decisions of these three sub-models are fused by a late multiply fusion module, which proved to yield better accuracy than averaging or maximizing fusion methods. The proposed combined model and its submodels have been evaluated both individually and collectively on four public data sets: UTkinect Action3D, SBU Interaction, Florence3-D Action, and NTU RGB+D. Our recognition rates outperform the state-ofthe-art rates on all the evaluated data sets.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research