JudgeLM: Scalable Language Models for Evaluating Large Language Models

In a study submitted to the arxiv* server, researchers investigated fine-tuning large language models (LLMs) as scalable judges to evaluate LLMs in open-ended benchmarks. As LLMs like  Chat Generative Pre-trained Transformer (GPT) and GPT-3 demonstrate remarkable capabilities in open-ended tasks, evaluating them becomes challenging using existing benchmarks and metrics. To address this, the researchers propose JudgeLM, scalable LLM judges that are trained to grade the quality of LLM-generated responses.

Study: JudgeLM: Scalable Language Models for Evaluating Large Language Models. Image credit: Generated using DALL.E.3
Study: JudgeLM: Scalable Language Models for Evaluating Large Language Models. Image credit: Generated using DALL.E.3

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Recent advances in foundation models like GPT-3 and T5 have enabled the creation of powerful LLMs through instruction tuning, such as ChatGPT and GPT-4. These models exhibit strong few-shot learning abilities across diverse tasks. However, evaluating their open-ended capabilities using existing benchmarks (like SuperGLUE) and metrics (like BLEU) has proven inadequate. Alternative evaluation methods based on human assessments or closed-source LLMs as judges have downsides like high cost, bias, privacy concerns and instability. This underscores the need for reproducible, efficient LLM judge models to evaluate LLMs in open-ended scenarios accurately.

Data Generation

The dataset comprises over 100,000 samples with seed tasks, LLM responses, and judgments from GPT-4. Seed tasks are drawn from diverse sources to ensure heterogeneity. LLM responses are collated from leading models like LLaMA and Vicuna. Judgments include scores and detailed reasoning for response pairs, with and without reference answers. This high-quality data enables training judges to score responses, even with external context reliably.

The data generation process involves three key steps. First, over 100,000 seed tasks are sampled from diverse instruction-tuning datasets to create a heterogeneous set of questions and prompts. Second, responses to these seed tasks are gathered from 11 popular LLMs encompassing models like LLaMA, Vicuna, and Alpaca. Third, these responses are fed alongside the seed tasks into GPT-4 to obtain fine-grained scores and reasoning judgments for pairs of responses. Two judgments are collected per response pair - one with and one without reference answers. This yields a rich training source with judgments adaptable to both settings.

Model Training

JudgeLMs are initialized from base LLM checkpoints like Vicuna and fine-tuned on the judge dataset using templates that frame it as a grading task. Multiple JudgeLMs are trained at 7B to 33B parameters to analyze size-capability tradeoffs.

The model training process formulates judge response scoring as an instruction following the task. JudgeLMs leverage the strong few-shot learning capabilities of modern LLMs. They are initialized with weights from base models like Vicuna-7B or Vicuna-33B. These base models ensure the foundational language skills. JudgeLMs are then fine-tuned on the released dataset using prompt templates tailored to framing judging as grading paired responses. To study scaling trends, JudgeLMs ranging from 7B to 33B parameters are trained. Larger JudgeLMs generally achieve higher performance but come at a proportional increase in computing costs.

Evaluation Protocol

JudgeLMs are evaluated on agreement with GPT-4 judgments and consistency when answers are swapped. Position bias, knowledge bias, and format bias are measured to study inherent limitations. Objective metrics like accuracy and subjective metrics like human alignment are reported on existing and new benchmarks.

The judges are evaluated using a rigorous protocol assessing objective and subjective capabilities. Quantitative metrics evaluate agreement with the GPT-4 teacher judgments and consistency when response positions are swapped. The consistency reveals biases like position bias, knowledge bias, and format bias. These provide insights into the judges' reliability limitations.

To measure human alignment, judgments are compared against expert and crowd annotations on existing and new benchmarks. The evaluation spans diverse settings - scoring single responses, ranking multiple responses, multimodal tasks, dialogue evaluation, etc. This multi-faceted protocol evaluates JudgeLMs' capabilities holistically.

Results

JudgeLMs achieved over 90% agreement with GPT-4, surpassing human consistency. Larger models exhibit higher performance. The best JudgeLM reaches up to 90% agreement with GPT-4, exceeding typical human consistency rates. Agreement tends to improve with model scale, with the 33B JudgeLM performing the strongest.

They obtain state-of-the-art results on existing judge benchmarks like PandaLM. The 33B JudgeLM outscores GPT-4 in accuracy. On the PandaLM benchmark for LLM judges, the JudgeLMs achieve new state-of-the-art results. Notably, the 33B JudgeLM surpasses the performance of the GPT-4 teacher in terms of accuracy.

JudgeLMs scale efficiently, judging 5000 responses in just minutes on eight graphics processing units (GPUs). They are over 100 times cheaper than GPT-4. Owing to their optimized design, JudgeLMs showcase excellent efficiency. A JudgeLM-7B can judge 5000 responses in just 3 minutes on 8 GPUs, drastically faster than previous methods. Their low cost makes them over 100 times more affordable than GPT-4. These findings underscore the viability of using fine-tuned LLM judges for reliable open-ended LLM evaluation.

JudgeLMs address key pain points of human evaluation like cost, bias, and scope constraints. Their quantifiable reliability, efficiency, and customizable nature enable autonomous LLM testing. The study also offers insights into biases that can degrade judge consistency, informing future work into robust judge architectures. Overall, JudgeLMs provide a scalable and reproducible solution for evaluating modern LLMs rapidly and accurately in the wild. Their continued development promises to accelerate the aligned deployment of increasingly capable LLMs.

Future Outlook

This research opens promising directions for future work on LLM judges. Two key priorities are scaling up judge models and dataset size. Larger judge model sizes, augmented training data, and techniques like synthetic data hold the potential for boosting capability further. Testing JudgeLMs on broader tasks and investigating sample-efficiency merits exploration. Architectural enhancements like hybrid human-JudgeLM loops could improve robustness. Overall, advancing JudgeLMs as an autonomous, low-cost, and unbiased LLM testing framework could profoundly impact the development of aligned LLMs.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
Aryaman Pattnayak

Written by

Aryaman Pattnayak

Aryaman Pattnayak is a Tech writer based in Bhubaneswar, India. His academic background is in Computer Science and Engineering. Aryaman is passionate about leveraging technology for innovation and has a keen interest in Artificial Intelligence, Machine Learning, and Data Science.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Pattnayak, Aryaman. (2023, October 31). JudgeLM: Scalable Language Models for Evaluating Large Language Models. AZoAi. Retrieved on November 24, 2024 from https://www.azoai.com/news/20231031/JudgeLM-Scalable-Language-Models-for-Evaluating-Large-Language-Models.aspx.

  • MLA

    Pattnayak, Aryaman. "JudgeLM: Scalable Language Models for Evaluating Large Language Models". AZoAi. 24 November 2024. <https://www.azoai.com/news/20231031/JudgeLM-Scalable-Language-Models-for-Evaluating-Large-Language-Models.aspx>.

  • Chicago

    Pattnayak, Aryaman. "JudgeLM: Scalable Language Models for Evaluating Large Language Models". AZoAi. https://www.azoai.com/news/20231031/JudgeLM-Scalable-Language-Models-for-Evaluating-Large-Language-Models.aspx. (accessed November 24, 2024).

  • Harvard

    Pattnayak, Aryaman. 2023. JudgeLM: Scalable Language Models for Evaluating Large Language Models. AZoAi, viewed 24 November 2024, https://www.azoai.com/news/20231031/JudgeLM-Scalable-Language-Models-for-Evaluating-Large-Language-Models.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Qwen2.5-Coder Redefines Coding AI With Scalable, High-Performance Models