In an article published in the journal Nature, researchers explored the impact of noise in human-labeled benchmarks for machine commonsense reasoning (CSR), a pivotal area in artificial intelligence (AI) research.
Conducting detailed noise audits under different experimental conditions, the study revealed significant levels of noise—level, pattern, and system—in both high-quality and crowdsourced labeling settings. Findings suggested that noise significantly affected performance estimates of CSR systems, challenging the prevalent reliance on a single ground truth in AI benchmarking practices.
Background
The field of machine CSR has garnered significant attention in AI research, especially with the rise of large language models (LLM). However, while numerous benchmarks have been developed to evaluate the performance of AI systems in this area, there has been a lack of focus on the quality of human-labeled datasets used for benchmarking.
Previous research has largely overlooked the presence and impact of noise in these labeled datasets, focusing more on bias. This paper addressed this gap by proposing a comprehensive noise audit of human-labeled benchmarks in machine CSR. Unlike previous studies, which often presumed a single ground truth without considering noise, this research aimed to quantify the types and amounts of noise present in labeled datasets.
By conducting noise audits under different experimental conditions, including both laboratory settings and online crowdsourced scenarios, the study provided insights into the variability of human judgment and its implications for evaluating the performance of CSR systems. Through this novel approach, the paper filled a critical gap in the understanding of the reliability and robustness of human-labeled benchmarks in AI research.
Annotation Methodology and Dataset Description
In this study, the researchers delved into the realm of human judgment and CSR, aiming to understand the impact of noise on data annotations. They utilized two benchmark datasets, theoretically grounded CSR (TG-CSR) and commonsense validation and explanation (ComVE), to explore the nuances of noise in different contexts. TG-CSR focused on lab-based scenarios, while ComVE represented a more realistic crowdsourced setting. The TG-CSR dataset comprised eight datasets across four contexts: vacationing abroad, camping, bad weather, and dental cleaning.
Each context presented various prompts aimed at assessing commonsense knowledge. Annotators were tasked with labeling prompts in either multiple-choice or true-false formats, providing diverse perspectives. The researchers employed Amazon Mechanical Turk (MTurk) to re-annotate the ComVE dataset, where annotators rated sentences as plausible or implausible. This approach aimed to evaluate a system's ability to discern between plausible and implausible statements.
To analyze noise, the researchers employed a comprehensive framework encompassing individual-level noise, pattern noise, system noise, and residual noise. They compared noise levels across different formats and datasets, shedding light on the intricacies of human judgment. Furthermore, the authors explored the impact of filtering annotators based on the number of labels provided. By varying filtering criteria, the researchers investigated how noise levels fluctuated, providing insights into quality control mechanisms in crowdsourced tasks.
In addition to assessing noise, the study evaluated human performance as CSR systems. Annotators were treated as independent systems, and their accuracy was evaluated against a reference ground truth derived from the majority labels provided by other annotators. Finally, the authors extended the analysis to assess the performance of Chat generative pre-trained transformer (ChatGPT), an LLM, on the ComVE dataset. By comparing ChatGPT's performance under different levels of noise, the researchers highlighted how noise influences performance estimates in AI systems.
Findings and Insights
The results revealed consistently non-zero levels of noise across all datasets, with lower levels observed in the true-false format compared to multiple-choice. This aligned with expectations that a clearer scale leads to less noise. While differences in noise levels were minimal across different contexts, pattern noise was prevalent, indicating disagreement among annotators on labeling. This suggested varying difficulty levels among prompts, which current benchmarks struggled to address with only a single ground-truth label per prompt.
Further analysis performed on the ComVE dataset reaffirmed the dominance of pattern noise, with filtering based on annotator reliability showing a minimal impact on noise levels. Accuracy estimates for both TG-CSR and ComVE datasets demonstrated high performance but with substantial confidence intervals due to noise, highlighting the challenge of evaluating systems on CSR benchmarks. MTurk participants who provided more labels tended to have narrower confidence intervals, but significant variation persisted even with hundreds of labels provided.
Regression analysis revealed a negative trend between accuracy and certain types of noise, indicating the influence of noise on performance estimates. Lastly, ChatGPT's performance on different noise partitions from ComVE showed significant differences, underscoring the impact of noise on model performance. These findings emphasized the importance of accounting for noise in evaluating AI systems on CSR tasks and suggested avenues for improving benchmark construction and evaluation methodologies.
Insights and Implications
Benchmark datasets were crucial for training and evaluating CSR systems, yet they faced challenges like data quality and reliance on single-ground truths. In this study, the researchers audited noise in two datasets, TG-CSR and ComVE, revealing significant levels of level, pattern, and system noise. While level noise was consistent, pattern noise varied across prompts, indicating diverse human interpretations.
Interestingly, TG-CSR showed reduced noise with a binary format. Filtering in ComVE reduced noise but sacrificed dataset size. The findings underscored the impact of noise on accuracy estimates, where even small variations affected system performance. Notably, without proper noise audits, differences in leaderboard rankings might stem from noise rather than genuine improvements. This study emphasized the importance of understanding and addressing noise in CSR benchmarks to ensure reliable evaluations.
Conclusion
In conclusion, the researchers illuminated the pervasive impact of noise in human-labeled benchmarks for machine CSR. By quantifying and analyzing noise levels across various datasets and experimental conditions, they underscored the need to reconsider benchmarking practices in AI research.
Recognizing the inevitability of noise, the findings advocated for more nuanced evaluation methodologies and the acknowledgment of multiple ground truths to ensure robust and reliable performance assessments. These insights not only advanced the understanding of CSR evaluation but also have broader implications for the broader field of machine learning reliant on human la
Journal reference:
- Kejriwal, M., Santos, H., Shen, K., Mulvehill, A. M., & McGuinness, D. L. (2024). A noise audit of human-labeled benchmarks for machine commonsense reasoning. Scientific Reports, 14(1), 8609. https://doi.org/10.1038/s41598-024-58937-4, https://www.nature.com/articles/s41598-024-58937-4