In an article posted to the Meta research website, researchers addressed the critical need for comprehensive tools to assess and improve the fairness of generative large language models (LLMs). With a focus on benchmarking and mitigation, the study compared prompt-based bias and toxicity metrics across diverse demographic axes and LLM families, providing insights for practitioners.
Background
The rapid advancement of generative LLMs has drawn increased attention to the potential ethical risks associated with their deployment. Previous studies have revealed that base LLMs exhibit substantial social biases related to gender, race, and other demographic attributes. Importantly, these biases tend to escalate as models scale in size, posing challenges to achieving equitable and unbiased natural language generation. While certain post hoc techniques, relying on human feedback, have shown promise in mitigating bias, the extent to which these approaches genuinely eliminate biases rather than merely concealing them remains unclear.
This study took a pivotal step by concentrating on base LLMs—those in their foundational state before fine-tuning techniques like reinforcement learning from human feedback (RLHF) are applied. The primary objective was to comprehend the core social biases inherent in these models to facilitate targeted mitigations at their source. To address the complexities of bias, the paper introduced a novel definition, interpreting bias as the proportion of subgroups for which the frequency of toxicity and negative regard generations exceeded an acceptable threshold. This definition aligned with the principle of demographic parity, serving as a benchmark for equality and fairness within natural language processing contexts.
As existing evaluations often focus on a limited set of demographic axes, such as binary gender, this work expanded its scope by evaluating LLMs across an extended suite of metrics and demographic categories. The aim was to enable direct model comparison and, consequently, advance the development of effective bias and toxicity mitigation techniques. The paper introduced new datasets, including AdvPromptSet and HolisticBiasR, to enhance benchmarking and mitigation studies, thereby contributing to a more nuanced understanding of biases in generative LLMs.
Methods
The study evaluated five families of generative LLMs – GPT-2, OPT, BlenderBot 3, BLOOM, and LLaMa – specifically focusing on base models without reinforcement learning from human or AI feedback. Multiple models were tested at different sizes to comprehensively assess their performance. The analysis incorporated an examination of the frequencies of demographic terms within LLMs' training corpora to better understand potential biases originating from the datasets. The study introduced the ROBBIE benchmark suite, which employed various datasets and classifiers to assess LLMs' performance on prompts related to bias and toxicity.
Two new datasets were introduced: (1) AdvPromptSet, a comprehensive adversarial text prompt set with over 197,000 prompts featuring varying toxicity levels and more than 24 demographic identity groups and (2) HolisticBiasR, an extension of the Regard dataset that replaced demographic noun phrases with terms from the HolisticBias dataset, expanding the demographic categories considered. The evaluation metrics included automatic bias and toxicity assessments across datasets such as Regard, RealToxicityPrompts, BOLD, and ToxiGen.
To contextualize bias and toxicity measurements, the study also reported on the generative capabilities and inference efficiency of each model, assessing generation quality, token throughput, latency, and device memory utilization. Additionally, the paper explored the effectiveness of three bias and toxicity mitigation techniques – prompting with hand-written templates and automatic prompt revision, self-debiasing, and adversarial triggering – across various models, metrics, and demographic axes. The research provided valuable insights for understanding and addressing ethical concerns associated with these models.
Results
The study employed various models, including GPT-2, BlenderBot 3, and others, assessing toxicity, negative regard, and bias across different model sizes and families. Bias was measured by comparing model responses against a background rate for each subgroup, aiming for models to respect all subgroups. The dataset included examples from Jigsaw Unintended Bias in Toxicity Classification.
Results showed that toxicity and negative regard often increase with model size, but not consistently. BiasScore was introduced to quantify biases, and the researchers explored fine-grained and intersectional biases using AdvPromptSet and HolisticBiasR datasets. Additionally, mitigation techniques, such as prompting, self-debiasing, and adversarial triggering, were tested for their effectiveness in reducing toxicity, negative regard, and bias.
Root cause analysis was conducted by examining the frequencies of demographic terms in training corpora, finding disparities in the mention of terms like "female" and "male" across datasets. The results provided insights into the complexities of bias mitigation, showing varied effectiveness across models and prompting the need for advanced methods as models grow larger. Human evaluation results, performance metrics, and root cause analysis contributed to a comprehensive understanding of biases in LLMs.
Conclusion
In conclusion, the study identified varying rates of toxicity and bias in language models across different prompt datasets, emphasizing the importance of assessing intersectional demographics. Mitigation techniques like self-debiasing were effective in smaller models while prompting was more effective in larger ones. The analysis of training corpora revealed an underrepresentation of gender and sex minority terms, potentially contributing to biases against LGBTQ+ groups.
The authors called for ongoing expansion of datasets and constant evolution of subgroup labels to reflect societal changes. Acknowledging the escalation of toxicity with model size, the future focus should include benchmarking RLHF models and the widespread adoption of multi-metric bias measurements.