GPT-4 Enhances Japanese Essay Scoring

In a recent article published in the journal Humanities & Social Sciences Communications, researchers investigated the feasibility of utilizing large language models (LLMs) for automated essay scoring (AES) in non-native Japanese writing, comparing the effectiveness of various LLMs with conventional machine learning-based AES tools.

Study: GPT-4 Enhances Japanese Essay Scoring. Image Credit: Paper piper/Shutterstock
Study: GPT-4 Enhances Japanese Essay Scoring. Image Credit: Paper piper/Shutterstock

Background

In recent years, artificial intelligence (AI) has made significant strides in natural language processing (NLP), particularly with the development of LLMs capable of generating fluent and coherent texts. LLMs have been applied to various language assessment tasks, including AES, automated listening tests, and automated oral proficiency assessments.

AES involves using computer programs to evaluate and score written texts based on predefined criteria. This process reduces the cost and time associated with human rating, provides consistent and objective feedback, and enhances the validity and reliability of assessments. It has been widely used for standardized tests, placement tests, and self-study tools.

However, most existing AES systems are designed for English or other European languages, and few studies have addressed the unique challenges of scoring non-native Japanese writing, which has a distinct structure and writing style compared to English.

About the Research

In this paper, the authors explored the potential of LLM-based AES for non-native Japanese writing by comparing the effectiveness of five models: two conventional machine learning-based models (Jess and JWriter), two LLMs [generative pre-trained transformer 4 (GPT-4) and bidirectional encoder representations from transformers (BERT)], and one Japanese local LLM [open-calm large model (OCLM)].

Conventional machine learning-based methods rely on predetermined linguistic features, such as lexical richness, syntactic complexity, and text cohesion, to train the model and assign scores. LLMs, on the other hand, use a transformer architecture and large amounts of text data to learn language representations and generate scores based on input prompts. The Japanese local LLM, OCLM, is a pre-trained model specifically designed for Japanese, incorporating the Lora Adapter and GPT-NeoX frameworks to enhance its language processing capabilities.

The study utilized a dataset of 1,400 story-writing scripts from the International Corpus of Japanese as a Second Language (I-JAS), consisting of 1,000 participants representing 12 different first languages. The participants completed two story-writing tasks based on 4-panel illustrations, and their Japanese language proficiency levels were assessed by two online tests: the Japanese Computerized Adaptive Test (J-CAT) and the online Japanese placement test (SPOT). Researchers used 16 measures to capture writing quality, including lexical richness, syntactic complexity, cohesion, content elaboration, and grammatical accuracy.

Research Findings

The authors conducted a statistical analysis to compare the annotation accuracy and learning level prediction of all the models. The outcomes showed that GPT-4 outperformed the other models in both aspects, with a quadratic weighted kappa (QWK) of 0.81 and a Pearson root mean square error (PRMSE) of 0.87. BERT and the OCLM achieved similar performance, with QWKs of 0.75 and 0.76, and PRMSEs of 0.81 and 0.82, respectively. Jess and JWriter performed poorly, with QWKs of 0.61 and 0.58, and PRMSEs of 0.69 and 0.66, respectively.

Additionally, the study compared 18 different models that utilized various prompts, such as the essay text, the essay text with the illustration, the essay text with the illustration and the title, and more. The results indicated that the prompt had a significant influence on the accuracy and reliability of the LLM-based AES.

The best prompt for GPT-4 was the essay text with the illustration and the title, achieving a QWK of 0.81 and a PRMSE of 0.87. The best prompt for BERT was the essay text with the illustration, achieving a QWK of 0.75 and a PRMSE of 0.81. The best prompt for the OCLM was the essay text with the illustration, the title, and the keywords, achieving a QWK of 0.76 and a PRMSE of 0.82.

Overall, the GPT-4 model demonstrated a high level of agreement with human raters across various writing proficiency criteria, including lexical richness, syntactic complexity, content, and grammatical accuracy. Notably, the agreement coefficient between GPT-4 and human scoring even surpassed the agreement among human raters themselves, highlighting the potential of GPT-4 to enhance AES by reducing biases and subjectivity.

Applications

The paper demonstrated the feasibility and effectiveness of LLM-based AES for non-native Japanese, which can have various applications in language education and assessment. These systems can be used as self-study tools for learners to practice their writing skills and receive instant, personalized feedback. They can also serve as supplementary tools for teachers, reducing their workload and enhancing teaching quality. Moreover, LLM-based AES can function as alternative or complementary tools for standardized tests, providing more valid and reliable scores.

Conclusion

In summary, the capability of LLM-based AES for non-native Japanese was comprehensively explored by comparing the efficiency of different models and prompts. Among all the LLMs and other AES tools utilized, GPT-4 demonstrated the best performance.

Additionally, prompt design was found to be crucial for achieving accurate and reliable evaluations. Future work should focus on improving the LLMs with more data and fine-tuning, investigating the effect of LLM-based AES on learners’ motivation and performance, and developing more user-friendly and interactive AES systems.

Journal reference:
Muhammad Osama

Written by

Muhammad Osama

Muhammad Osama is a full-time data analytics consultant and freelance technical writer based in Delhi, India. He specializes in transforming complex technical concepts into accessible content. He has a Bachelor of Technology in Mechanical Engineering with specialization in AI & Robotics from Galgotias University, India, and he has extensive experience in technical content writing, data science and analytics, and artificial intelligence.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Osama, Muhammad. (2024, June 12). GPT-4 Enhances Japanese Essay Scoring. AZoAi. Retrieved on December 11, 2024 from https://www.azoai.com/news/20240612/GPT-4-Enhances-Japanese-Essay-Scoring.aspx.

  • MLA

    Osama, Muhammad. "GPT-4 Enhances Japanese Essay Scoring". AZoAi. 11 December 2024. <https://www.azoai.com/news/20240612/GPT-4-Enhances-Japanese-Essay-Scoring.aspx>.

  • Chicago

    Osama, Muhammad. "GPT-4 Enhances Japanese Essay Scoring". AZoAi. https://www.azoai.com/news/20240612/GPT-4-Enhances-Japanese-Essay-Scoring.aspx. (accessed December 11, 2024).

  • Harvard

    Osama, Muhammad. 2024. GPT-4 Enhances Japanese Essay Scoring. AZoAi, viewed 11 December 2024, https://www.azoai.com/news/20240612/GPT-4-Enhances-Japanese-Essay-Scoring.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
ByteDance Unveils Revolutionary Image Generation Model That Sets New Benchmark