M2D-CLAP: Exploring General-purpose Audio-Language Representations Beyond CLAP
Author(s) -
Daisuke Niizumi,
Daiki Takeuchi,
Masahiro Yasuda,
Binh Thien Nguyen,
Yasunori Ohishi,
Noboru Harada
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3611348
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Contrastive language-audio pre-training (CLAP), which learns audio-language representations by aligning audio and text in a common feature space, has become popular for solving audio tasks. However, CLAP’s audio features lack generalizability, whereas self-supervised learning (SSL) models offer general-purpose features that perform well across diverse audio tasks. We aim to develop a broadly applicable audio representation and hypothesize that a model that learns both general audio and CLAP features should achieve our goal, which we call a general-purpose audio-language representation. To implement our hypothesis, we propose M2D-CLAP, the first approach to jointly learn effective general audio and CLAP features. It extends an SSL masked modeling duo (M2D) by incorporating CLAP and utilizes LLM-based sentence embeddings. The training process consists of multiple stages. In the first stage, generalizable audio features are pre-trained via a multitask objective combining M2D and CLAP, with CLAP leveraging LLM-based semantic embeddings to distill semantic knowledge into them. In the following stages, CLAP features are pre-trained and refined with guidance from the learned audio features. Experiments demonstrated that M2D-CLAP learns high-performing general audio features (e.g., AudioSet mAP of 49.0, SOTA results in music tasks) and CLAP features, thereby enabling a general-purpose audio-language representation.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom