z-logo
open-access-imgOpen Access
M2D-CLAP: Exploring General-purpose Audio-Language Representations Beyond CLAP
Author(s) -
Daisuke Niizumi,
Daiki Takeuchi,
Masahiro Yasuda,
Binh Thien Nguyen,
Yasunori Ohishi,
Noboru Harada
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3611348
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Contrastive language-audio pre-training (CLAP), which learns audio-language representations by aligning audio and text in a common feature space, has become popular for solving audio tasks. However, CLAP’s audio features lack generalizability, whereas self-supervised learning (SSL) models offer general-purpose features that perform well across diverse audio tasks. We aim to develop a broadly applicable audio representation and hypothesize that a model that learns both general audio and CLAP features should achieve our goal, which we call a general-purpose audio-language representation. To implement our hypothesis, we propose M2D-CLAP, the first approach to jointly learn effective general audio and CLAP features. It extends an SSL masked modeling duo (M2D) by incorporating CLAP and utilizes LLM-based sentence embeddings. The training process consists of multiple stages. In the first stage, generalizable audio features are pre-trained via a multitask objective combining M2D and CLAP, with CLAP leveraging LLM-based semantic embeddings to distill semantic knowledge into them. In the following stages, CLAP features are pre-trained and refined with guidance from the learned audio features. Experiments demonstrated that M2D-CLAP learns high-performing general audio features (e.g., AudioSet mAP of 49.0, SOTA results in music tasks) and CLAP features, thereby enabling a general-purpose audio-language representation.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom