Evaluating Large Language Models: A Comprehensive Survey

Download PDF Copy

By Aryaman PattnayakReviewed by Susha Cheriyedath, M.Sc.Nov 2 2023

In a paper submitted to the arXiv* server, researchers from Tianjin University in China presented a meticulous taxonomy and comprehensive literature survey appraising and scrutinizing large language models (LLMs) across a broad spectrum of capabilities, dimensions, and specialized domains.

*Study: Evaluating Large Language Models: A Comprehensive Survey. Image credit: Generated using DALL.E.3*

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Knowledge and Capability Evaluation

The survey delves into assessments to gauge LLMs' knowledge, reasoning, and tool-learning abilities using an array of question-answering, knowledge completion, and reasoning tasks. Question answering (QA) and knowledge completion tasks are quintessential litmus tests for examining the practical application of knowledge ingrained within LLMs.

Prominent benchmarks include the following:

Stanford QA Dataset (SQuAD)
Narrative QA (NarrativeQA)
Multiple-Choice Temporal Commonsense (MCTACO)
LAMA Probing Tasks (LAMA) are used for probing question-answering proficiency WikiFact Verification (WikiFact)
Knowledge Memorization Task (KoLA), which are used for evaluating knowledge completion.

Reasoning evaluations encompass commonsense, logical, multi-hop, and mathematical reasoning, providing rigorous assessments of LLMs' meta-reasoning capabilities. Commonsense reasoning datasets include:

Social Interaction QA (Social IQA)
TIMEDIAL: TimeDial Commonsense Reasoning (TIMEDIAL)
Physical Interaction QA (PIQA) examines the ability to apply situational assumptions and everyday knowledge.

Logical reasoning assessments leverage natural language inference (NLI), reading comprehension, and text generation datasets, including the following:

Stanford NLI (SNLI)
Reading Comprehension with Logic (ReClor)
LogicInference.

Multi-hop reasoning benchmarks like Hotpot QA (HotpotQA) and Hybrid QA (HybridQA) demand traversing chains of facts.

Mathematical reasoning datasets such as the following assess calculation proficiency:

Mathematical QA (MathQA)
Joint Entrance Examination (JEE) Benchmark (JEEBench)
MATH 401 Arithmetic Dataset (MATH 401).

Toxicity was detected using the following datasets:

Offensive Language Identification Dataset (OLID)
Sarcasm, Offensive and Toxic Language Identification (SOLID)
Korean Dataset for Offensive Language Identification (KODOLI)
RealToxicityPrompts
Harmful Questions (HarmfulQ)
QA and question generation (QAQG)
Adversarial Natural Language Intelligence Benchmark for Evaluation (AdvGLUE).

Recent works have also appraised LLMs' burgeoning proficiency in tool use for search engines, online shopping, robotic control, and more. Evaluations focus on tool manipulation using existing datasets and specialized benchmarks like ToolAlpaca. Tool creation skills are evaluated through datasets covering diverse problem-solving scenarios.

Alignment Evaluation

To ascertain that LLM-generated content remains aligned with expectations, the survey provides an exhaustive exploration of ethics, bias, toxicity, and truthfulness assessments. In ethics evaluations, datasets apply guidelines defined by experts, crowdsourcing, or AI assistance. Moral foundations theory offers an expert ethos, while crowdsourced judgments from Reddit forums provide a more democratic approach. Recent efforts use LLMs to aid dataset creation through drafted scenarios and annotations.

Bias evaluations scrutinize societal biases propagated through various downstream tasks, including coreference resolution, translation, sentiment analysis, and relation extraction. Datasets are purpose-built to uncover biases in LLMs, analyze preferences reflecting stereotypes, and employ automatic metrics, human ratings, and multiple-choice questions.

Truthfulness evaluations utilize question-answering, dialogue, and summarization datasets containing unanswerable questions or annotations of factual consistency. NLI, QAQG, and LLM-based methods automatically verify the factual accuracy between system outputs and source texts.

Safety Evaluation

The survey discusses meticulous LLM safety evaluations from the perspective of robustness and risks. Robustness benchmarks assess model stability against disturbances in prompts, tasks, and alignment. Prompt robustness is gauged using adversarial prompts and typos. Task robustness extends across translation, QA, classification, and more using AdvGLUE and perturbed versions of existing datasets. Alignment robustness is evaluated by measuring vulnerability to jailbreak prompts that elicit unethical behavior.

Emerging works have begun evaluating risks as LLMs approach advanced capabilities. Behaviors such as power-seeking and situational awareness are surfaced through QA datasets and environment simulations. LLMs are also assessed as agents in interactive environments using toolkits to build customized simulations.

Specialized LLM Evaluation

The survey highlights specialized LLM evaluations tailored for applications in biology, medicine, education, law, computer science, and finance.

In biology and medicine, evaluations include real-world medical licensing exams, scientific question-answering datasets, and human assessments of diagnostic quality. Evaluations also emulate real-world scenarios like patient triage, consultation, and evidence summarization.

Education evaluations assess teaching skills using ratings of pedagogical dialogue quality and feedback generation. Learning is evaluated by comparing LLM-generated hints and feedback to human tutors. Legal evaluations include bar exam assessments and legal reading comprehension datasets. LLMs are also evaluated on real applications such as legal judgment summarization and term explanations using both automatic metrics and lawyers' assessments.

Computer science evaluations focus on code generation, using functional correctness tests and human assessments of generated programs and explanations. Finance evaluations utilize exams, domain-specialized QA, and rating conversations with a financial robo-advisor.

Future Outlook

The survey concludes by underscoring promising directions for LLM evaluation research. Proposed future trajectories include comprehensive risk evaluation through controlled simulations, thorough agent testing in diverse interactive environments, dynamic benchmarking with rapidly updated tests, and enhancement-oriented assessments that provide actionable feedback.

The exhaustive taxonomy and detailed literature review aim to galvanize progress in LLM evaluation, guiding judicious development to maximize societal benefit. The authors reiterate that rigorous, comprehensive evaluation frameworks will be integral to cultivating LLMs that are helpful, harmless, and honest.

Journal reference:

Preliminary scientific report. Guo, Z., Jin, R., Liu, C., Huang, Y., Shi, D., Supryadi, Yu, L., Liu, Y., Li, J., Xiong, B., & Xiong, D. (2023, October 31). Evaluating Large Language Models: A Comprehensive Survey. ArXiv.org. https://doi.org/10.48550/arXiv.2310.19736, https://arxiv.org/abs/2310.19736

Posted in: AI Research News

Comments (0)

Written by

Aryaman Pattnayak

Aryaman Pattnayak is a Tech writer based in Bhubaneswar, India. His academic background is in Computer Science and Engineering. Aryaman is passionate about leveraging technology for innovation and has a keen interest in Artificial Intelligence, Machine Learning, and Data Science.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Pattnayak, Aryaman. (2023, November 02). Evaluating Large Language Models: A Comprehensive Survey. AZoAi. Retrieved on July 05, 2025 from https://www.azoai.com/news/20231102/Evaluating-Large-Language-Models-A-Comprehensive-Survey.aspx.
MLA
Pattnayak, Aryaman. "Evaluating Large Language Models: A Comprehensive Survey". AZoAi. 05 July 2025. <https://www.azoai.com/news/20231102/Evaluating-Large-Language-Models-A-Comprehensive-Survey.aspx>.
Chicago
Pattnayak, Aryaman. "Evaluating Large Language Models: A Comprehensive Survey". AZoAi. https://www.azoai.com/news/20231102/Evaluating-Large-Language-Models-A-Comprehensive-Survey.aspx. (accessed July 05, 2025).
Harvard
Pattnayak, Aryaman. 2023. Evaluating Large Language Models: A Comprehensive Survey. AZoAi, viewed 05 July 2025, https://www.azoai.com/news/20231102/Evaluating-Large-Language-Models-A-Comprehensive-Survey.aspx.