Is This the Best Prompt? Scoring Prompts for Arabic NLP Across LLMs
Author(s) -
Dania Refai,
Maged S Al-Shaibani,
Irfan Ahmad
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3616181
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Large language models (LLMs) demonstrate impressive capabilities across a range of natural language processing (NLP) tasks. However, they are highly sensitive to prompt design, which significantly affects their ability to align outputs with user intent. Poorly crafted prompts can result in misleading or irrelevant responses. Nevertheless, selecting the most effective prompt from several candidates remains an open challenge. Despite the growing importance of prompt engineering, there is no comprehensive framework to systematically evaluate prompts across multiple dimensions, such as similarity, performance, efficiency, and consistency, particularly in scenarios where performance can be traded off against computational cost or consistency. In this study, we propose a novel scoring framework to evaluate handcrafted prompts across four essential dimensions: Similarity, performance, efficiency (measured by latency, input tokens, and output tokens), and consistency. Considering Arabic, a relatively low-resource, morphologically rich language, as a case study, we evaluated this framework on six diverse text classification tasks: Dialect identification, sentiment analysis, offensive language detection, stance detection, emotion detection, and sarcasm detection. Our methodology assesses prompts across multiple LLMs (GPT-4o mini, LLaMA, ALLAM, and Claude 3.5 Haiku), providing valuable insights into model-specific and task-specific performance patterns. Results demonstrate that no single prompt universally excels across all dimensions; rather, optimal prompts vary based on specific task requirements and evaluation priorities. The proposed framework enables the identification of the most effective prompts for each application context while revealing important trade-offs between performance metrics. By addressing the unique challenges of Arabic NLP, this research not only advances prompt engineering for underrepresented languages but also provides a systematic and adaptable methodology for prompt evaluation that can enhance LLM performance across a range of linguistic contexts, diverse domains, tasks, and various model architectures.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom