Calibrating F1 Scores for Fair Performance Comparison of Binary Classification Models with Application to Student Dropout Prediction | Zendy

Hyeon Gyu Kim | Zendy; Yoohyun Park | Zendy

Open Access

Calibrating F1 Scores for Fair Performance Comparison of Binary Classification Models with Application to Student Dropout Prediction

Author(s) -

Hyeon Gyu Kim,

Yoohyun Park

Publication year - 2025

Publication title -

ieee access

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 0.587

H-Index - 127

eISSN - 2169-3536

DOI - 10.1109/access.2025.3594735

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

The F1 score has been widely used to measure the performance of machine learning models. However, it is variant to the ratio of the positive class in the training data, π. Depending on how large π is, it can be underestimated or overestimated, making it difficult to fairly compare the performance of models. In this study, we discuss how to calibrate the F1 score for fair performance comparison of binary classification models trained on data with different positive class ratios. We initially demonstrated that the F1 score is inverse proportional to accuracy according to the change in π. From the relationship, the calibrated F1 score was defined as an arithmetic mean of the two measures, which we named the F1* score. Since many prior studies only presented the F1 score or accuracy for model performance, but not both, we provided additional equations to estimate the expected F1 score or accuracy when one of the two measures is available. The accuracy of the presented equations was examined through experiments with a real dataset aimed at student dropout prediction, and the results showed that the mean absolute difference between the derived and actual values was less than 0.01, inferring that the proposed F1* score can calibrate a given F1 score with a high level of accuracy. We also conducted an example analysis comparing the performance of existing models using the F1* score to highlight its efficacy.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research