Research Library

open-access-imgOpen AccessSonicVisionLM: Playing Sound with Vision Language Models
Author(s)
Zhifeng Xie,
Shengye Yu,
Mengtian Li,
Qile He,
Chaofeng Chen,
Yu-Gang Jiang
Publication year2024
There has been a growing interest in the task of generating sound for silentvideos, primarily because of its practicality in streamlining videopost-production. However, existing methods for video-sound generation attemptto directly create sound from visual representations, which can be challengingdue to the difficulty of aligning visual representations with audiorepresentations. In this paper, we present SonicVisionLM, a novel frameworkaimed at generating a wide range of sound effects by leveraging vision languagemodels. Instead of generating audio directly from video, we use thecapabilities of powerful vision language models (VLMs). When provided with asilent video, our approach first identifies events within the video using a VLMto suggest possible sounds that match the video content. This shift in approachtransforms the challenging task of aligning image and audio into morewell-studied sub-problems of aligning image-to-text and text-to-audio throughthe popular diffusion models. To improve the quality of audio recommendationswith LLMs, we have collected an extensive dataset that maps text descriptionsto specific sound effects and developed temporally controlled audio adapters.Our approach surpasses current state-of-the-art methods for converting video toaudio, resulting in enhanced synchronization with the visuals and improvedalignment between audio and video components. Project page:https://yusiissy.github.io/SonicVisionLM.github.io/
Language(s)English

Seeing content that should not be on Zendy? Contact us.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here