z-logo
open-access-imgOpen Access
A Lip Reading Method Based on 3D Convolutional Vision Transformer
Author(s) -
Huijuan Wang,
Gangqiang Pu,
Tingyu Chen
Publication year - 2022
Publication title -
ieee access
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.587
H-Index - 127
ISSN - 2169-3536
DOI - 10.1109/access.2022.3193231
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Lip reading has received increasing attention in recent years. It judges the content of speech based on the movement of the speaker’s lips. The rapid development of deep learning has promoted progress in lip reading. However, due to lip reading needs to process the information of continuous video frames, it is necessary to consider the correlation information between adjacent images and the correlation between long-distance images. Moreover, lip reading recognition mainly focuses on the subtle changes of lips and their surrounding environment, and it is necessary to extract the subtle features of small-size images. Therefore, the performance of machine lip reading is generally not high, and the research progress is slow. In order to improve the performance of machine lip reading, we propose a lip reading method based on 3D convolutional vision transformer (3DCvT), which combines vision transformer and 3D convolution to extract the spatio-temporal feature of continuous images, and take full advantage of the properties of convolutions and transformers to extract local and global features from continuous images effectively. The extracted features are then sent to a Bidirectional Gated Recurrent Unit (BiGRU) for sequence modeling. We proved the effectiveness of our method on large-scale lip reading datasets LRW and LRW-1000 and achieved state-of-the-art performance.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here