z-logo
open-access-imgOpen Access
Three Years of VoiceMOS Challenges: Lessons Learned by the UWB-NTIS-TTS Team
Author(s) -
Marie Kunesova,
Jindrich Matousek,
Jan Lehecka,
Jan Svec,
Daniel Tihelka,
Zdenek Hanzlicek
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3596644
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Automatic prediction of mean-opinion scores (MOS) promises a faster, cheaper alternative to listening tests, yet robust generalization across speakers, languages, and domains remains a significant challenge. This article presents our system designs and experimental results from three years of participation in the VoiceMOS Challenges (2022–2024), covering MOS prediction for synthesized or voice-converted speech and singing voice, including out-of-domain and cross-language conditions.We evaluate six neural architectures – wav2vec 2.0, QuartzNet, CNN-RNN, LDNet, RawNet3, and HiFi-GAN – and their ensembles. Across all tasks, we find that (i) self-supervised acoustic encoders are the most consistently reliable foundation, (ii) ensembling yields rapidly diminishing returns once complementary representations are covered, and (iii) the diversity and balance of training data outweigh architectural complexity. Notably, the indiscriminate fusion strategy that performed well in 2022 degrades under the mismatched French TTS conditions of 2023, emphasizing the importance of out-of-domain validation. Further experiments show that carefully pruned ensembles can modestly outperform the best single model while remaining within real-time constraints. We conclude with several observations to guide the development of computationally efficient, domain-robust MOS prediction systems.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom