In an article recently submitted to the arXiv* server, researchers introduced LiveBench, a benchmark designed to prevent test set contamination and biases from large language model (LLM) judging and human crowdsourcing.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
LiveBench features frequently updated questions from recent sources, automatic scoring based on objective values, and various challenging tasks. The benchmark includes contamination-free tasks from previous benchmarks like big-bench hard and automated mathematical proofs scoring (AMPS). Evaluations of both closed and open-source models show top models achieving low accuracy.
Related Work
Previous works have introduced several prominent LLM benchmarks relevant to the study. The huggingface open LLM Leaderboard tracks LLM performance but is prone to test set contamination due to its static nature. Benchmarks like alpacaeval, model test bench (MT-Bench), and arena-hard use LLM judges but suffer from biases and errors, while human-judging benchmarks like chatbot arena are labor-intensive and variable. Other benchmarks, such as LiveCodeBench (LCB) and the scalable evaluation of aligned language models (SEAL) Benchmark, focus on specific tasks or use private questions. Still, LiveBench aims to be comprehensive and continuously updated.
LiveBench Overview Summary
This section introduces LiveBench as a benchmark comprising six categories: math, coding, reasoning, data analysis, instruction following, and language comprehension. Each category includes two to four tasks with questions from recent information sources or more challenging variants of existing benchmarks.
Tasks typically consist of approximately 50 questions, varying in difficulty from easy to highly demanding, aiming for an overall success rate of 30-70% across top models. Prompts within each category are tailored to include a zero-shot chain of thought, requiring models to guess when unsure and to output answers in a format that is easily parseable, indicated by double asterisks.
The math category incorporates questions from recent high school competitions, fill-in-the-blank problems from the proof-based United states of America Mathematical Olympiad (USAMO), and a more demanding version of the arbitrary math problem-solving (AMPS) dataset. Tasks such as olympiad feature questions from competitions like assessing arithmetic, algebra, geometry, number theory, and more complex mathematical problem-solving skills. Additionally, the AMPS_hard task includes synthetic questions more challenging than those found in the original AMPS dataset.
LiveBench's coding category includes two distinct tasks: an adapted version of the code generation task from lower confidence bound (LCB) and a novel code completion task. The LCB Generation task evaluates a model's ability to interpret and correctly respond to a competition-level coding prompt using questions derived from the LiveCodeBench collection. Meanwhile, the Completion task focuses on the model's capability to finish partially correct solutions sourced from GitHub for LeetCode medium and hard problems, omitting the final portion of each solution and prompting the LLM to complete it.
The reasoning category of LiveBench includes tasks derived from BigBench Hard and Zebra puzzles. The Web of Lies v2 task expands on a challenge from BigBench, requiring models to evaluate the validity of a Boolean function presented in natural language with added deductive elements and misleading clues to heighten difficulty.
Similarly, the zebra puzzles task assesses models' ability to follow constraints and logically deduce information using procedurally generated puzzles. In LiveBench's data analysis category, three tasks evaluate the LLM's skills in data manipulation and interpretation: column type annotation, table reformatting, and table join, each testing the model's capability in different aspects of handling structured data.
LLM Benchmark Evaluation
This Experimental setup involves 49 different LLMs, encompassing a mix of proprietary, large open-source, and small open-source models. These include various versions of generative pre-trained transformer (GPT) models like GPT-4 and GPT-3.5, Anthropic models such as Claude-3, Mistral models like mistral-large-2402 and mistral-small-2402, Google's Gemini-1.5 models, and a range of others from the open-source community such as deep seek-coder-v2 and phi-3-small-128k-instruct.
The experiments evaluate these models across all 18 LiveBench tasks using standardized evaluation settings with FastChat templates and bfloat16 precision.
The comparison then proceeds to LiveBench's performance against established benchmarks. Notably, varying strengths are observed among models across different benchmarks, with some models demonstrating significantly higher performance on one benchmark over the other.
For instance, models like GPT-4-0125-preview and GPT-4-turbo-2024-04-09 exhibit notably stronger results on Arena-Hard, potentially influenced by biases associated with using GPT-4 as the judging LLM. These findings underscore the importance of comprehensively considering benchmark-specific biases and preferences in evaluating LLM capabilities.
Conclusion
To summarize this work, LiveBench was introduced as an LLM benchmark to address issues like test set contamination and reliance on LLM judging and human crowdsourcing. It was the first benchmark to incorporate regularly updated questions sourced from recent information, with difficulty increasing over time. Answers were objectively scored based on ground-truth values, eliminating the need for LLM judges. LiveBench featured various challenging tasks in math, coding, reasoning, language, instruction following, and data analysis.
Future work for LiveBench will expand task repositories to cover emerging artificial intelligence (AI) and natural language processing (NLP) domains. Efforts will refine evaluation methods for enhanced benchmark robustness, fostering collaboration with the research community to drive innovation and advance LLM capabilities.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
White, C., et al. (2024). LiveBench: A Challenging, Contamination-Free LLM Benchmark. ArXiv. DOI:10.48550/arXiv.2406.19314, https://arxiv.org/abs/2406.19314