Evaluating Large Language Models: A Comprehensive Survey

In a paper submitted to the arXiv* server, researchers from Tianjin University in China presented a meticulous taxonomy and comprehensive literature survey appraising and scrutinizing large language models (LLMs) across a broad spectrum of capabilities, dimensions, and specialized domains.

Study: Evaluating Large Language Models: A Comprehensive Survey. Image credit: Generated using DALL.E.3
Study: Evaluating Large Language Models: A Comprehensive Survey. Image credit: Generated using DALL.E.3

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Knowledge and Capability Evaluation

The survey delves into assessments to gauge LLMs' knowledge, reasoning, and tool-learning abilities using an array of question-answering, knowledge completion, and reasoning tasks. Question answering (QA) and knowledge completion tasks are quintessential litmus tests for examining the practical application of knowledge ingrained within LLMs.

Prominent benchmarks include the following:

  • Stanford QA Dataset (SQuAD)
  • Narrative QA (NarrativeQA)
  • Multiple-Choice Temporal Commonsense (MCTACO)
  • LAMA Probing Tasks (LAMA) are used for probing question-answering proficiency WikiFact Verification (WikiFact)
  • Knowledge Memorization Task (KoLA), which are used for evaluating knowledge completion.

Reasoning evaluations encompass commonsense, logical, multi-hop, and mathematical reasoning, providing rigorous assessments of LLMs' meta-reasoning capabilities. Commonsense reasoning datasets include:

  • Social Interaction QA (Social IQA)
  • TIMEDIAL: TimeDial Commonsense Reasoning (TIMEDIAL)
  • Physical Interaction QA (PIQA) examines the ability to apply situational assumptions and everyday knowledge.

Logical reasoning assessments leverage natural language inference (NLI), reading comprehension, and text generation datasets, including the following:

  • Stanford NLI (SNLI)
  • Reading Comprehension with Logic (ReClor)
  • LogicInference.

Multi-hop reasoning benchmarks like Hotpot QA (HotpotQA) and Hybrid QA (HybridQA) demand traversing chains of facts.

Mathematical reasoning datasets such as the following assess calculation proficiency:

  • Mathematical QA (MathQA)
  • Joint Entrance Examination (JEE) Benchmark (JEEBench)
  • MATH 401 Arithmetic Dataset (MATH 401).

Toxicity was detected using the following datasets:

  • Offensive Language Identification Dataset (OLID)
  • Sarcasm, Offensive and Toxic Language Identification (SOLID)
  • Korean Dataset for Offensive Language Identification (KODOLI)
  • RealToxicityPrompts
  • Harmful Questions (HarmfulQ)
  • QA and question generation (QAQG)
  • Adversarial Natural Language Intelligence Benchmark for Evaluation (AdvGLUE).

Recent works have also appraised LLMs' burgeoning proficiency in tool use for search engines, online shopping, robotic control, and more. Evaluations focus on tool manipulation using existing datasets and specialized benchmarks like ToolAlpaca. Tool creation skills are evaluated through datasets covering diverse problem-solving scenarios.

Alignment Evaluation

To ascertain that LLM-generated content remains aligned with expectations, the survey provides an exhaustive exploration of ethics, bias, toxicity, and truthfulness assessments. In ethics evaluations, datasets apply guidelines defined by experts, crowdsourcing, or AI assistance. Moral foundations theory offers an expert ethos, while crowdsourced judgments from Reddit forums provide a more democratic approach. Recent efforts use LLMs to aid dataset creation through drafted scenarios and annotations.

Bias evaluations scrutinize societal biases propagated through various downstream tasks, including coreference resolution, translation, sentiment analysis, and relation extraction. Datasets are purpose-built to uncover biases in LLMs, analyze preferences reflecting stereotypes, and employ automatic metrics, human ratings, and multiple-choice questions.

Truthfulness evaluations utilize question-answering, dialogue, and summarization datasets containing unanswerable questions or annotations of factual consistency. NLI, QAQG, and LLM-based methods automatically verify the factual accuracy between system outputs and source texts.

Safety Evaluation

The survey discusses meticulous LLM safety evaluations from the perspective of robustness and risks. Robustness benchmarks assess model stability against disturbances in prompts, tasks, and alignment. Prompt robustness is gauged using adversarial prompts and typos. Task robustness extends across translation, QA, classification, and more using AdvGLUE and perturbed versions of existing datasets. Alignment robustness is evaluated by measuring vulnerability to jailbreak prompts that elicit unethical behavior.

Emerging works have begun evaluating risks as LLMs approach advanced capabilities. Behaviors such as power-seeking and situational awareness are surfaced through QA datasets and environment simulations. LLMs are also assessed as agents in interactive environments using toolkits to build customized simulations.

Specialized LLM Evaluation

The survey highlights specialized LLM evaluations tailored for applications in biology, medicine, education, law, computer science, and finance.

In biology and medicine, evaluations include real-world medical licensing exams, scientific question-answering datasets, and human assessments of diagnostic quality. Evaluations also emulate real-world scenarios like patient triage, consultation, and evidence summarization.

Education evaluations assess teaching skills using ratings of pedagogical dialogue quality and feedback generation. Learning is evaluated by comparing LLM-generated hints and feedback to human tutors. Legal evaluations include bar exam assessments and legal reading comprehension datasets. LLMs are also evaluated on real applications such as legal judgment summarization and term explanations using both automatic metrics and lawyers' assessments.

Computer science evaluations focus on code generation, using functional correctness tests and human assessments of generated programs and explanations. Finance evaluations utilize exams, domain-specialized QA, and rating conversations with a financial robo-advisor.

Future Outlook

The survey concludes by underscoring promising directions for LLM evaluation research. Proposed future trajectories include comprehensive risk evaluation through controlled simulations, thorough agent testing in diverse interactive environments, dynamic benchmarking with rapidly updated tests, and enhancement-oriented assessments that provide actionable feedback.

The exhaustive taxonomy and detailed literature review aim to galvanize progress in LLM evaluation, guiding judicious development to maximize societal benefit. The authors reiterate that rigorous, comprehensive evaluation frameworks will be integral to cultivating LLMs that are helpful, harmless, and honest.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference:
Aryaman Pattnayak

Written by

Aryaman Pattnayak

Aryaman Pattnayak is a Tech writer based in Bhubaneswar, India. His academic background is in Computer Science and Engineering. Aryaman is passionate about leveraging technology for innovation and has a keen interest in Artificial Intelligence, Machine Learning, and Data Science.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Pattnayak, Aryaman. (2023, November 02). Evaluating Large Language Models: A Comprehensive Survey. AZoAi. Retrieved on July 03, 2024 from https://www.azoai.com/news/20231102/Evaluating-Large-Language-Models-A-Comprehensive-Survey.aspx.

  • MLA

    Pattnayak, Aryaman. "Evaluating Large Language Models: A Comprehensive Survey". AZoAi. 03 July 2024. <https://www.azoai.com/news/20231102/Evaluating-Large-Language-Models-A-Comprehensive-Survey.aspx>.

  • Chicago

    Pattnayak, Aryaman. "Evaluating Large Language Models: A Comprehensive Survey". AZoAi. https://www.azoai.com/news/20231102/Evaluating-Large-Language-Models-A-Comprehensive-Survey.aspx. (accessed July 03, 2024).

  • Harvard

    Pattnayak, Aryaman. 2023. Evaluating Large Language Models: A Comprehensive Survey. AZoAi, viewed 03 July 2024, https://www.azoai.com/news/20231102/Evaluating-Large-Language-Models-A-Comprehensive-Survey.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Decoding AI's Emotional Intelligence