Calibrating Generative AI for Second Language Writing Assessment: Combining Statistical Validation with Prompt Design

Farzi, Reza

doi:10.61838/japes.2.4.1

Calibrating Generative AI for Second Language Writing Assessment: Combining Statistical Validation with Prompt Design

Fichiers

JAPES-2-4-code91_260124_103414.pdf (578.29 KB)

Date

2024-12-10

Authors

Farzi, Reza

Licence Creative Commons

Attribution-NonCommercial 4.0 International

Résumé

Generative artificial intelligence (GenAI) is emerging as a powerful tool in second language writing assessment, offering the potential for rapid, consistent, and scalable evaluation. However, its value depends on whether its scoring reflects the nuanced judgments of experienced human raters. This study introduces the concept of calibration in the context of second language writing assessment, defined as the deliberate and iterative refinement of AI prompts, guided by statistical evidence, to align AI scoring with human evaluative reasoning. 60 essays produced by 30 upper intermediate learners of English were evaluated independently by an experienced human rater and by ChatGPT 3.5, using the English for Academic Purposes (EAP) Writing Assessment Rubric. Statistical analyses assessed inter rater agreement, score consistency, and systematic bias. In the initial baseline stage, ChatGPT 3.5 tended to act as a strict marker, applying the rubric literally and assigning lower scores than the human rater. Across three calibration stages, which included clarifying rubric descriptors, refining interpretive guidance, and incorporating representative scoring examples, the AI scoring moved closer to the human benchmark. Agreement improved from a Cohen’s kappa of 0.52 to 0.89, correlation from .76 to .94, and the mean score difference narrowed from -2.45 to - 0.95, the latter no longer statistically significant. Qualitative analysis showed a shift from a narrow emphasis on surface errors to a more balanced consideration of accuracy, organization, development, and communicative effectiveness. The results suggest that calibration offers a replicable and evidence-based approach to integrating generative AI into second language writing assessment, enhancing the fairness, validity, and reliability of AI assisted evaluation.

Mots-clés

Generative artificial intelligence, Second language writing assessment, Calibration, Prompt design, Inter-rater reliability, Statistical validation

Citation

Farzi, R. (2024). Calibrating Generative AI for Second Language Writing Assessment: Combining Statistical Validation with Prompt Design. Assessment and Practice in Educational Sciences, 2(4), 1-12.

URI

http://hdl.handle.net/10393/51670

Collections

ILOB - Publications // OLBI - Publications

Notice complète

Calibrating Generative AI for Second Language Writing Assessment: Combining Statistical Validation with Prompt Design

Fichiers

Date

Authors

Nom de la revue

ISSN de la revue

Titre du volume

Éditeur

Licence Creative Commons

Résumé

Description

Mots-clés

Citation

URI

Collections

Approbation

Évaluation

Complété par

Référencé par