Is This the Best Prompt? Scoring Prompts for Arabic NLP Across LLMs | Zendy

Dania Refai | Zendy; Maged S Al-Shaibani | Zendy; Irfan Ahmad | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Is This the Best Prompt? Scoring Prompts for Arabic NLP Across LLMs

Author(s) -

Dania Refai,

Maged S Al-Shaibani,

Irfan Ahmad

Publication year - 2025

Publication title -

ieee access

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 0.587

H-Index - 127

eISSN - 2169-3536

DOI - 10.1109/access.2025.3616181

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

Large language models (LLMs) demonstrate impressive capabilities across a range of natural language processing (NLP) tasks. However, they are highly sensitive to prompt design, which significantly affects their ability to align outputs with user intent. Poorly crafted prompts can result in misleading or irrelevant responses. Nevertheless, selecting the most effective prompt from several candidates remains an open challenge. Despite the growing importance of prompt engineering, there is no comprehensive framework to systematically evaluate prompts across multiple dimensions, such as similarity, performance, efficiency, and consistency, particularly in scenarios where performance can be traded off against computational cost or consistency. In this study, we propose a novel scoring framework to evaluate handcrafted prompts across four essential dimensions: Similarity, performance, efficiency (measured by latency, input tokens, and output tokens), and consistency. Considering Arabic, a relatively low-resource, morphologically rich language, as a case study, we evaluated this framework on six diverse text classification tasks: Dialect identification, sentiment analysis, offensive language detection, stance detection, emotion detection, and sarcasm detection. Our methodology assesses prompts across multiple LLMs (GPT-4o mini, LLaMA, ALLAM, and Claude 3.5 Haiku), providing valuable insights into model-specific and task-specific performance patterns. Results demonstrate that no single prompt universally excels across all dimensions; rather, optimal prompts vary based on specific task requirements and evaluation priorities. The proposed framework enables the identification of the most effective prompts for each application context while revealing important trade-offs between performance metrics. By addressing the unique challenges of Arabic NLP, this research not only advances prompt engineering for underrepresented languages but also provides a systematic and adaptable methodology for prompt evaluation that can enhance LLM performance across a range of linguistic contexts, diverse domains, tasks, and various model architectures.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research