Evaluating Language Models with SKILL-MIX: A Novel Approach

In an article recently submitted to the ArXiV* server, the authors discussed the evolving role of Large Language Models (LLMs). They proposed a novel evaluation method called "SKILL-MIX" to assess their ability to combine learned skills flexibly. This evaluation involved generating text by combining random subsets of skills from a given list, revealing differences among LLM capabilities not captured by traditional rankings. They argued that this methodology could establish a broader ecosystem for evaluating future AI models, marking a shift in assessing these AI agents.

Study: Evaluating Language Models with SKILL-MIX: A Novel Approach. Image credit: Generated using DALL.E.3
Study: Evaluating Language Models with SKILL-MIX: A Novel Approach. Image credit: Generated using DALL.E.3

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Prior Research

As LLMs transition from statistical language models to versatile AI agents, existing evaluation methods still need to be improved. These evaluations, designed for academic reasoning, often suffer from contamination and "cramming" issues due to the model's training data's secrecy. A pressing need exists for more relevant, contamination-resistant, scalable, and comprehensible evaluation methods. Existing evaluations have limitations, such as being susceptible to training data contamination and experiencing issues related to "cramming" for leaderboard performance. Additionally, the secrecy surrounding training data makes verifying the originality of model-generated text challenging.

Designing SKILL-MIX

Picking Skills and Topics: To develop the SKILL-MIX evaluation, the authors selected a set of 101 language skills and 100 topics in the following manner. They focused on basic language skills with a Wikipedia entry whose definitions are understandable to the average college student. Skills were initially gathered from textbooks on logical reasoning, rhetoric, and theory of mind, with those that were either challenging to combine with other skills or too specialized for a broad range of topics eliminated. They created a description and illustrative example for each skill, sourcing them from textbooks or Wikipedia, occasionally making modifications for clarity.

To compile a list of 100 topics, an initial list was narrowed down based on the unigram frequency of the topic and its synonyms using Google Ngram Viewer. Topics had to have an average unigram frequency of around 10-6 to ensure reasonable coverage in contemporary datasets while keeping the chance of encountering k skills in the context of the topic limited.

Since the primary goal of the evaluation is to assess general-purpose text generation capabilities rather than specific skills and topics, only ten skills and ten topics were released to avoid potential "cramming."

Evaluation Procedure: The SKILL-MIX evaluation consists of two parts: generation and grading. In the generation phase, a language model, referred to as the "student," is provided with a set of k skills, their definitions, and a topic. The student’s task is to generate natural text that demonstrates these k skills within the context of the given topic. After the student develops the text, a separate grading language model, known as the "Grader," evaluates it.

Models Used for Generation: Researchers used many language models for text generation in the SKILL-MIX evaluation, but models that can respond to specific prompts were preferred. These models included Language Model Analysis and Assessment (LLaMA)-2-7B-Chat, LLaMA-2-13B-Chat, LLaMA2-70B-Chat, Generative Pre-trained Transformer (GPT)-3.5-turbo, GPT-4, Falcon-180B-Chat, Xwin-LM-70B-V0.1, Mistral-7B-Instruct-v0.1, Qwen-14B-Chat, and Tigerbot-70B-Chat. These models are known for their instruction-tuned capabilities.

Models Used for Grading: Not all language models are proficient at grading—some struggle to recognize the presence of skills, even when correctly demonstrated. Therefore, LLaMA-2-70B-Chat and GPT-4 were selected for grading after manual spot-checking to ensure their alignment with human grading standards.

Generation Prompt: Researchers prompted the student model with a list of selected k skills, providing complete definitions and illustrative examples. They then instructed it to produce a brief text that illustrated all k skills within the context of the provided topic. The evaluation allowed the student to review and improve their initial answer, often leading to significantly better second responses. Researchers designed the prompts to facilitate this, instructing the students to separate their answers and explanations using labels ("Answer" and "Explanation").

In the auto-grading process for SKILL-MIX (k) prompts, the (Student) model's responses are evaluated based on several criteria: the proper use of the k skills, relevance to the given topic, adherence to a specified sentence limit (k - 1), and the production of coherent text. Researchers assign partial credit for partial compliance with these criteria. Grading was carried out using GPT-4 and LLaMA-2-70B-Chat, with spot-checking by the authors.

The grading process was fine-tuned during a trial run, considering prompt sensitivity. The authors introduced a distinct grading methodology that required individual component scores to handle arithmetic challenges, and they later consolidated these scores using a Python script. GPT-4 and LLaMA-2 were reliable graders, although they exhibited some challenges with basic arithmetic. The modified grading prompt requested a separate assessment of individual components, including skill use, topic relevance, and text coherence, resulting in more accurate grading.

Experimental Results

The ablation experiments to enhance SKILL-MIX demonstrate its responsiveness to the evolving landscape of language models. By fine-tuning the evaluation criteria, SKILL-MIX proves its value as a robust benchmark for assessing the proficiency of advanced models. This adaptability will be essential in maintaining rigorous evaluations in the dynamic field of natural language processing, ensuring that benchmarks keep pace with model advancements.

Conclusion

In summary, SKILL-MIX is a valuable evaluation tool for assessing language models' general capabilities and compositional skills. Testing models with randomly chosen combinations of skills and topics challenges them to handle novel scenarios. However, human evaluations exhibited variance; the proprietary models typically align with the perceived quality, and exceptions were larger models tended to achieve higher saturation points.

The study also raises concerns that open models may tailor themselves for leaderboard performance, known as "cramming." GPT-4's performance indicates that it goes beyond mere "stochastic parrot" behavior. This adaptive evaluation approach may also extend to multi-modal AI models in the future.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2023, November 01). Evaluating Language Models with SKILL-MIX: A Novel Approach. AZoAi. Retrieved on November 21, 2024 from https://www.azoai.com/news/20231101/Evaluating-Language-Models-with-SKILL-MIX-A-Novel-Approach.aspx.

  • MLA

    Chandrasekar, Silpaja. "Evaluating Language Models with SKILL-MIX: A Novel Approach". AZoAi. 21 November 2024. <https://www.azoai.com/news/20231101/Evaluating-Language-Models-with-SKILL-MIX-A-Novel-Approach.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "Evaluating Language Models with SKILL-MIX: A Novel Approach". AZoAi. https://www.azoai.com/news/20231101/Evaluating-Language-Models-with-SKILL-MIX-A-Novel-Approach.aspx. (accessed November 21, 2024).

  • Harvard

    Chandrasekar, Silpaja. 2023. Evaluating Language Models with SKILL-MIX: A Novel Approach. AZoAi, viewed 21 November 2024, https://www.azoai.com/news/20231101/Evaluating-Language-Models-with-SKILL-MIX-A-Novel-Approach.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
AI Researchers Reveal New Method for Measuring How Much is 'Too Much' in Image Generation Models