It’s Dangerous to Prompt Alone! Exploring how Fine-tuning GPT-4o Affects Novices’ Programming Error Resolution | Zendy

Eddie Antonio Santos | Zendy; Audrey Salmon | Zendy; Katie Hammer | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

It’s Dangerous to Prompt Alone! Exploring how Fine-tuning GPT-4o Affects Novices’ Programming Error Resolution

Author(s) -

Eddie Antonio Santos,

Audrey Salmon,

Katie Hammer

Publication year - 2025

Publication title -

ieee access

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 0.587

H-Index - 127

eISSN - 2169-3536

DOI - 10.1109/access.2025.3613500

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

Large language models (LLMs) like GPT-4o are increasingly being incorporated into computer science classrooms in tasks such as helping students resolve programming error messages. Prior work has found weak or insignificant support that LLM-generated error explanations help students resolve errors faster, especially compared to expert-handwritten error explanations. This work explores how GPT-4o can be influenced through fine-tuning to produce error explanations that resemble expert-handwritten explanations. We evaluate how three error message styles (traditional Python error messages, baseline GPT-4o explanations, fine-tuned GPT-4o explanations) affect novice programmers’ ability to resolve programming errors. Eighty ( n = 80) CS1 students completed six debugging tasks using a within-subjects study design that ensured each participant saw each of the three error message styles twice. We measured time-to-fix, number of code reruns, task failure rate, and then asked for students’ preferences. Participants were found to resolve errors significantly faster with the fine-tuned GPT-4o messages, but, like prior work, we could not find a statistically significant difference between the baseline GPT-4o and traditional Python error messages. Error message style had no effect on number of reruns and task failures; however, the particular debugging task had a significant difference on both. Despite solving problems faster with the fine-tuned GPT-4o model, participants preferred the baseline GPT-4o output (similar to unmodified ChatGPT). These findings strongly suggest that the nature of the debugging task has a greater effect on novice programmers’ error resolution than whether or not the students use a GenAI system to help them debug.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research