In a recent article published in the journal Humanities & Social Sciences Communications, researchers investigated the feasibility of utilizing large language models (LLMs) for automated essay scoring (AES) in non-native Japanese writing, comparing the effectiveness of various LLMs with conventional machine learning-based AES tools.
Background
In recent years, artificial intelligence (AI) has made significant strides in natural language processing (NLP), particularly with the development of LLMs capable of generating fluent and coherent texts. LLMs have been applied to various language assessment tasks, including AES, automated listening tests, and automated oral proficiency assessments.
AES involves using computer programs to evaluate and score written texts based on predefined criteria. This process reduces the cost and time associated with human rating, provides consistent and objective feedback, and enhances the validity and reliability of assessments. It has been widely used for standardized tests, placement tests, and self-study tools.
However, most existing AES systems are designed for English or other European languages, and few studies have addressed the unique challenges of scoring non-native Japanese writing, which has a distinct structure and writing style compared to English.
About the Research
In this paper, the authors explored the potential of LLM-based AES for non-native Japanese writing by comparing the effectiveness of five models: two conventional machine learning-based models (Jess and JWriter), two LLMs [generative pre-trained transformer 4 (GPT-4) and bidirectional encoder representations from transformers (BERT)], and one Japanese local LLM [open-calm large model (OCLM)].
Conventional machine learning-based methods rely on predetermined linguistic features, such as lexical richness, syntactic complexity, and text cohesion, to train the model and assign scores. LLMs, on the other hand, use a transformer architecture and large amounts of text data to learn language representations and generate scores based on input prompts. The Japanese local LLM, OCLM, is a pre-trained model specifically designed for Japanese, incorporating the Lora Adapter and GPT-NeoX frameworks to enhance its language processing capabilities.
The study utilized a dataset of 1,400 story-writing scripts from the International Corpus of Japanese as a Second Language (I-JAS), consisting of 1,000 participants representing 12 different first languages. The participants completed two story-writing tasks based on 4-panel illustrations, and their Japanese language proficiency levels were assessed by two online tests: the Japanese Computerized Adaptive Test (J-CAT) and the online Japanese placement test (SPOT). Researchers used 16 measures to capture writing quality, including lexical richness, syntactic complexity, cohesion, content elaboration, and grammatical accuracy.
Research Findings
The authors conducted a statistical analysis to compare the annotation accuracy and learning level prediction of all the models. The outcomes showed that GPT-4 outperformed the other models in both aspects, with a quadratic weighted kappa (QWK) of 0.81 and a Pearson root mean square error (PRMSE) of 0.87. BERT and the OCLM achieved similar performance, with QWKs of 0.75 and 0.76, and PRMSEs of 0.81 and 0.82, respectively. Jess and JWriter performed poorly, with QWKs of 0.61 and 0.58, and PRMSEs of 0.69 and 0.66, respectively.
Additionally, the study compared 18 different models that utilized various prompts, such as the essay text, the essay text with the illustration, the essay text with the illustration and the title, and more. The results indicated that the prompt had a significant influence on the accuracy and reliability of the LLM-based AES.
The best prompt for GPT-4 was the essay text with the illustration and the title, achieving a QWK of 0.81 and a PRMSE of 0.87. The best prompt for BERT was the essay text with the illustration, achieving a QWK of 0.75 and a PRMSE of 0.81. The best prompt for the OCLM was the essay text with the illustration, the title, and the keywords, achieving a QWK of 0.76 and a PRMSE of 0.82.
Overall, the GPT-4 model demonstrated a high level of agreement with human raters across various writing proficiency criteria, including lexical richness, syntactic complexity, content, and grammatical accuracy. Notably, the agreement coefficient between GPT-4 and human scoring even surpassed the agreement among human raters themselves, highlighting the potential of GPT-4 to enhance AES by reducing biases and subjectivity.
Applications
The paper demonstrated the feasibility and effectiveness of LLM-based AES for non-native Japanese, which can have various applications in language education and assessment. These systems can be used as self-study tools for learners to practice their writing skills and receive instant, personalized feedback. They can also serve as supplementary tools for teachers, reducing their workload and enhancing teaching quality. Moreover, LLM-based AES can function as alternative or complementary tools for standardized tests, providing more valid and reliable scores.
Conclusion
In summary, the capability of LLM-based AES for non-native Japanese was comprehensively explored by comparing the efficiency of different models and prompts. Among all the LLMs and other AES tools utilized, GPT-4 demonstrated the best performance.
Additionally, prompt design was found to be crucial for achieving accurate and reliable evaluations. Future work should focus on improving the LLMs with more data and fine-tuning, investigating the effect of LLM-based AES on learners’ motivation and performance, and developing more user-friendly and interactive AES systems.