In a recent article published in the journal Computers and Education: Artificial Intelligence, researchers investigated the potential of large language models (LLMs) to automate the scoring of essays written by English language learners. Their goal was to evaluate these advanced artificial intelligence (AI) systems as tools for automated essay scoring (AES).
Background
AES is the process of using technology to analyze and evaluate written work, usually by assigning a numerical score. It offers a solution to challenges like the time, cost, and inconsistency associated with human scoring. It also faces some issues, such as validity, reliability, transparency, and ethical issues.
Traditional AES systems use machine learning to extract and compare specific features of writing, such as grammar, vocabulary, and coherence against human-scored essays or set criteria. However, these systems are limited by their feature selection, the genre of writing, and their accessibility and cost.
LLMs are AI systems capable of generating natural language texts. Although not specifically designed for AES, LLMs offer advantages over traditional systems, such as versatility and user interaction through chatbots like chat generative pre-text transformer (ChatGPT), Bard, and Claude.
About the Research
In this paper, the authors aimed to explore the validity and reliability of generative LLMs in scoring student writing. Their primary goal was to evaluate the performance of four widely used LLMs: Google’s PaLM2, Anthropic’s Claude 2, and OpenAI’s generative pre-text transfer 3.5 (GPT-3.5) and GPT-4, in assessing essays written by English language learners.
For this study, the researchers selected 119 essays from an English language university admission and placement test. Each essay was scored twice by each LLM on separate occasions and by two human raters using a holistic rubric. The main metrics for assessing the models' performance were intrarater reliability (consistency of scores given by the same rater over time) and interrater reliability (agreement between scores given by different raters). The authors also evaluated the validity of the LLMs' scores by comparing them to the human ratings. They measured these reliability metrics using the intraclass correlation coefficient (ICC) and Pearson’s correlation.
The methodology involved a detailed analysis of the models' scoring patterns and their consistency over time. The study also examined potential reasons for any variability in the models' performance and offered insights into the strengths and weaknesses of each LLM in the context of AES.
Research Findings
The outcomes showed that GPT-4 was the most reliable LLM, showing excellent intrarater reliability and strong validity, with a high correlation to human raters, comparable to traditional AES systems. Claude 2 demonstrated good intrarater reliability and moderate validity. PaLM2 and GPT-3.5 showed moderate intra and interrater reliability. Most LLMs, except GPT-3.5, improved their intrarater reliability over time. However, the interrater reliability of GPT-3.5 and GPT-4 decreased slightly over time.
The study also identified limitations in LLM performance, such as scoring on a continuous scale, completing unfinished sentences, hallucinating text features, and showing non-deterministic behavior. These issues could arise from factors like randomness in sampling, temperature settings, token limits, and model updates. Despite their advanced capabilities, LLMs can exhibit variability due to essay topic complexity, training data differences, and the distinct ways humans and AI assess writing.
Applications
This research has significant implications for the future of AES and educational technology. The demonstrated reliability and validity of models like GPT-4 suggest that they can be effectively integrated into educational environments to assist with essay grading. This integration could reduce the grading burden on educators, allowing them to focus more on teaching and providing personalized support to students.
Additionally, the adaptability of generative AI models extends beyond traditional essay assessments. They can be used for a variety of writing tasks, including creative writing and technical reports. Their accessibility and ease of use make them valuable tools for providing formative feedback, enabling students to improve their writing skills through immediate and detailed evaluations.
Conclusion
In summary, the LLMs proved effective in revolutionizing language assessment practices. The researchers highlighted that these models were not specifically designed for AES and lacked full transparency and understanding. Moving forward, they emphasized the need for further research to evaluate the validity and reliability of LLMs across various contexts, writing genres, and assessment criteria.
Additionally, they underscored the importance of addressing the ethical and pedagogical implications of using LLMs for AES. Furthermore, they suggested that cross-disciplinary collaboration between computational linguists, machine learning experts, and language assessment experts can fine-tune LLMs for the specific purpose of assessing language.