Open Access
Capturing the spatio‐temporal continuity for video semantic segmentation
Author(s) -
Chen Xin,
Wu Aming,
Han Yahong
Publication year - 2019
Publication title -
iet image processing
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.401
H-Index - 45
eISSN - 1751-9667
pISSN - 1751-9659
DOI - 10.1049/iet-ipr.2018.6479
Subject(s) - computer science , segmentation , artificial intelligence , feature (linguistics) , decoding methods , frame (networking) , pattern recognition (psychology) , convolutional neural network , encoding (memory) , computer vision , representation (politics) , image segmentation , feature extraction , fuse (electrical) , algorithm , telecommunications , philosophy , linguistics , electrical engineering , politics , political science , law , engineering
In recent years, image semantic segmentation based on a convolutional neural network has achieved many advances. However, the development of video semantic segmentation is relatively slow. Directly applying the image segmentation algorithms to each video frame separately may ignore the temporal region continuity inherent in videos. In this study, the authors propose a novel deep neural network architecture with a newly devised spatio‐temporal continuity (STC) module for video semantic segmentation. Particularly, the architecture includes an encoding network, an STC module, and a decoding network. The encoding network is used to extract a high‐level feature map. The STC module then uses the high‐level feature map as input to extract the STC feature map. For decoding, they use four dilated convolutional layers to obtain more abstract representation and a deconvolutional layer to increase the size of the representation. Finally, they fuse the current feature representation and the previous feature representation and get the class probabilities. Thus, this architecture receives a sequence of consecutive video frames and outputs the segmentation result of the current frame. They extensively evaluate the proposed approach on the CamVid and KITTI datasets. Compared with other methods, the authors’ approach not only achieves competitive performance but also has lower complexity.