
Advancing Cancer Research with Synthetic Data Generation in Low-Data Scenarios
Author(s) -
Patricia A. Apellaniz,
Borja Arroyo Galende,
Ana Jimenez,
Juan Parras,
Santiago Zazo
Publication year - 2025
Publication title -
ieee journal of biomedical and health informatics
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 1.293
H-Index - 125
eISSN - 2168-2208
pISSN - 2168-2194
DOI - 10.1109/jbhi.2025.3595371
Subject(s) - bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , signal processing and analysis
The scarcity of medical data, particularly in Survival Analysis (SA) for cancer-related diseases, challenges data-driven healthcare research. While Synthetic Tabular Data Generation (STDG) models have been proposed to address this issue, most rely on datasets with abundant samples, not reflecting real-world limitations. We suggest using an STDG approach that leverages transfer learning and meta-learning techniques to create an artificial inductive bias, guiding generative models trained on limited samples. Initial experiments were conducted on larger classification datasets, allowing us to asses the methodology across varying sample sizes and abundant versus scarce data scenarios. We primarily employed clinical utility validation for cancer-related SA data, as divergence-based similarity validation was not feasible. The methodology improved STDG under constrained data conditions, with divergence-based similarity validation proving to be a robust measure of data quality. Conversely, clinical utility validation yielded similar results regardless of sample size, indicating its limitations in statistically confirming effective STDG. In SA experiments, we observed that altering the task can reveal if relationships among variables are accurately generated, with most cases benefiting from the proposed methodology. Our study underscores the efficacy of the approach in tackling medical data scarcity by effectively generating high-quality synthetic data under constrained conditions. While divergence-based similarity validation is essential when sufficient data are available, clinical utility validation alone is insufficient and should be complemented by similarity validation. These findings underscore the potential and limitations of STDG methodologies in addressing medical data scarcity.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom