Research Library

open-access-imgOpen AccessLarge-scale unsupervised audio pre-training for video-to-speech synthesis
Author(s)
Triantafyllos Kefalas,
Yannis Panagakis,
Maja Pantic
Publication year2024
Publication title
ieee/acm transactions on audio, speech, and language processing
Resource typeMagazines
PublisherIEEE
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. Previous approaches train on data from almost exclusively audio-visual datasets, i.e., every audio sample has a corresponding video sample. This precludes the use of abundant audio-only datasets which may not have a corresponding visual modality such as audiobooks, radio podcasts, and speech recognition datasets. In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz, and then use the pretrained decoders to initialize the audio decoders for the video-tospeech synthesis task. The pre-training step uses audio samples only and does not require labels or corresponding samples from other modalities (visual, text). We demonstrate that this improves the reconstructed speech and that it is an unexplored way to improve the quality of the generator in a cross-modal task while only requiring samples from one of the modalities. We conduct experiments using both raw audio and mel spectrograms as target outputs and benchmark our models with existing work.
Subject(s)communication, networking and broadcast technologies , computing and processing , general topics for engineers , signal processing and analysis
Keyword(s)Decoding, Spectrogram, Hidden Markov models, Visualization, Predictive models, Training, Speech recognition, Video-to-speech, speech synthesis, generative adversarial networks (GANs), conformer, pre-training
Language(s)English
SCImago Journal Rank0.916
H-Index56
eISSN2329-9304
pISSN2329-9290
DOI10.1109/taslp.2024.3382500

Seeing content that should not be on Zendy? Contact us.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here