In a paper submitted to the arXiv* server, researchers from Tianjin University in China presented a meticulous taxonomy and comprehensive literature survey appraising and scrutinizing large language models (LLMs) across a broad spectrum of capabilities, dimensions, and specialized domains.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Knowledge and Capability Evaluation
The survey delves into assessments to gauge LLMs' knowledge, reasoning, and tool-learning abilities using an array of question-answering, knowledge completion, and reasoning tasks. Question answering (QA) and knowledge completion tasks are quintessential litmus tests for examining the practical application of knowledge ingrained within LLMs.
Prominent benchmarks include the following:
- Stanford QA Dataset (SQuAD)
- Narrative QA (NarrativeQA)
- Multiple-Choice Temporal Commonsense (MCTACO)
- LAMA Probing Tasks (LAMA) are used for probing question-answering proficiency WikiFact Verification (WikiFact)
- Knowledge Memorization Task (KoLA), which are used for evaluating knowledge completion.
Reasoning evaluations encompass commonsense, logical, multi-hop, and mathematical reasoning, providing rigorous assessments of LLMs' meta-reasoning capabilities. Commonsense reasoning datasets include:
- Social Interaction QA (Social IQA)
- TIMEDIAL: TimeDial Commonsense Reasoning (TIMEDIAL)
- Physical Interaction QA (PIQA) examines the ability to apply situational assumptions and everyday knowledge.
Logical reasoning assessments leverage natural language inference (NLI), reading comprehension, and text generation datasets, including the following:
- Stanford NLI (SNLI)
- Reading Comprehension with Logic (ReClor)
- LogicInference.
Multi-hop reasoning benchmarks like Hotpot QA (HotpotQA) and Hybrid QA (HybridQA) demand traversing chains of facts.
Mathematical reasoning datasets such as the following assess calculation proficiency:
- Mathematical QA (MathQA)
- Joint Entrance Examination (JEE) Benchmark (JEEBench)
- MATH 401 Arithmetic Dataset (MATH 401).
Toxicity was detected using the following datasets:
- Offensive Language Identification Dataset (OLID)
- Sarcasm, Offensive and Toxic Language Identification (SOLID)
- Korean Dataset for Offensive Language Identification (KODOLI)
- RealToxicityPrompts
- Harmful Questions (HarmfulQ)
- QA and question generation (QAQG)
- Adversarial Natural Language Intelligence Benchmark for Evaluation (AdvGLUE).
Recent works have also appraised LLMs' burgeoning proficiency in tool use for search engines, online shopping, robotic control, and more. Evaluations focus on tool manipulation using existing datasets and specialized benchmarks like ToolAlpaca. Tool creation skills are evaluated through datasets covering diverse problem-solving scenarios.
Alignment Evaluation
To ascertain that LLM-generated content remains aligned with expectations, the survey provides an exhaustive exploration of ethics, bias, toxicity, and truthfulness assessments. In ethics evaluations, datasets apply guidelines defined by experts, crowdsourcing, or AI assistance. Moral foundations theory offers an expert ethos, while crowdsourced judgments from Reddit forums provide a more democratic approach. Recent efforts use LLMs to aid dataset creation through drafted scenarios and annotations.
Bias evaluations scrutinize societal biases propagated through various downstream tasks, including coreference resolution, translation, sentiment analysis, and relation extraction. Datasets are purpose-built to uncover biases in LLMs, analyze preferences reflecting stereotypes, and employ automatic metrics, human ratings, and multiple-choice questions.
Truthfulness evaluations utilize question-answering, dialogue, and summarization datasets containing unanswerable questions or annotations of factual consistency. NLI, QAQG, and LLM-based methods automatically verify the factual accuracy between system outputs and source texts.
Safety Evaluation
The survey discusses meticulous LLM safety evaluations from the perspective of robustness and risks. Robustness benchmarks assess model stability against disturbances in prompts, tasks, and alignment. Prompt robustness is gauged using adversarial prompts and typos. Task robustness extends across translation, QA, classification, and more using AdvGLUE and perturbed versions of existing datasets. Alignment robustness is evaluated by measuring vulnerability to jailbreak prompts that elicit unethical behavior.
Emerging works have begun evaluating risks as LLMs approach advanced capabilities. Behaviors such as power-seeking and situational awareness are surfaced through QA datasets and environment simulations. LLMs are also assessed as agents in interactive environments using toolkits to build customized simulations.
Specialized LLM Evaluation
The survey highlights specialized LLM evaluations tailored for applications in biology, medicine, education, law, computer science, and finance.
In biology and medicine, evaluations include real-world medical licensing exams, scientific question-answering datasets, and human assessments of diagnostic quality. Evaluations also emulate real-world scenarios like patient triage, consultation, and evidence summarization.
Education evaluations assess teaching skills using ratings of pedagogical dialogue quality and feedback generation. Learning is evaluated by comparing LLM-generated hints and feedback to human tutors. Legal evaluations include bar exam assessments and legal reading comprehension datasets. LLMs are also evaluated on real applications such as legal judgment summarization and term explanations using both automatic metrics and lawyers' assessments.
Computer science evaluations focus on code generation, using functional correctness tests and human assessments of generated programs and explanations. Finance evaluations utilize exams, domain-specialized QA, and rating conversations with a financial robo-advisor.
Future Outlook
The survey concludes by underscoring promising directions for LLM evaluation research. Proposed future trajectories include comprehensive risk evaluation through controlled simulations, thorough agent testing in diverse interactive environments, dynamic benchmarking with rapidly updated tests, and enhancement-oriented assessments that provide actionable feedback.
The exhaustive taxonomy and detailed literature review aim to galvanize progress in LLM evaluation, guiding judicious development to maximize societal benefit. The authors reiterate that rigorous, comprehensive evaluation frameworks will be integral to cultivating LLMs that are helpful, harmless, and honest.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Guo, Z., Jin, R., Liu, C., Huang, Y., Shi, D., Supryadi, Yu, L., Liu, Y., Li, J., Xiong, B., & Xiong, D. (2023, October 31). Evaluating Large Language Models: A Comprehensive Survey. ArXiv.org. https://doi.org/10.48550/arXiv.2310.19736, https://arxiv.org/abs/2310.19736