z-logo
open-access-imgOpen Access
It’s Dangerous to Prompt Alone! Exploring how Fine-tuning GPT-4o Affects Novices’ Programming Error Resolution
Author(s) -
Eddie Antonio Santos,
Audrey Salmon,
Katie Hammer
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3613500
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Large language models (LLMs) like GPT-4o are increasingly being incorporated into computer science classrooms in tasks such as helping students resolve programming error messages. Prior work has found weak or insignificant support that LLM-generated error explanations help students resolve errors faster, especially compared to expert-handwritten error explanations. This work explores how GPT-4o can be influenced through fine-tuning to produce error explanations that resemble expert-handwritten explanations. We evaluate how three error message styles (traditional Python error messages, baseline GPT-4o explanations, fine-tuned GPT-4o explanations) affect novice programmers’ ability to resolve programming errors. Eighty ( n = 80) CS1 students completed six debugging tasks using a within-subjects study design that ensured each participant saw each of the three error message styles twice. We measured time-to-fix, number of code reruns, task failure rate, and then asked for students’ preferences. Participants were found to resolve errors significantly faster with the fine-tuned GPT-4o messages, but, like prior work, we could not find a statistically significant difference between the baseline GPT-4o and traditional Python error messages. Error message style had no effect on number of reruns and task failures; however, the particular debugging task had a significant difference on both. Despite solving problems faster with the fine-tuned GPT-4o model, participants preferred the baseline GPT-4o output (similar to unmodified ChatGPT). These findings strongly suggest that the nature of the debugging task has a greater effect on novice programmers’ error resolution than whether or not the students use a GenAI system to help them debug.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom