Can Language Models Stop Making Stuff Up? New OpenAI Benchmark Puts AI to the Test

Download PDF Copy

By Soham NandiReviewed by Joel ScanlonNov 12 2024

Discover how SimpleQA is testing the limits of language models by measuring accuracy on straightforward questions, pushing the next generation of AI to reduce false answers and sharpen reliability.

Research: Measuring short-form factuality in large language models. Image Credit: Shutterstock AI (PDF)

In an article recently posted to the OpenAI website, researchers introduced Simple Question-Answering (SimpleQA), a benchmark designed to assess the ability of language models (LMs) to accurately answer concise, fact-based questions. They focused on questions with clear, unambiguous answers to streamline evaluation and challenged models with adversarially sourced queries from generative pre-trained transformers (GPT)-4. This evaluation framework was designed to test whether models could assess their own knowledge limits—answering accurately when confident and abstaining when uncertain—marking a shift toward supporting more advanced model development.

Background

The quest to improve factual accuracy and reliability in language models (LMs) has spurred the development of new benchmarks to assess model responses. Many current models, including some of the most advanced, frequently produce answers lacking accuracy or clear substantiation—a phenomenon known as model “hallucination.” Existing benchmarks, like TriviaQA and Natural Questions, have become less challenging for today’s leading models and no longer effectively measure progress in factual reliability.

To address this, the researchers presented SimpleQA, a new benchmark specifically designed for evaluating short, accurate responses to fact-based questions. SimpleQA included 4,326 diverse questions, carefully designed to reduce complexity while ensuring high accuracy and low variability across runs. Unlike previous benchmarks that primarily evaluated long-form or open-ended answers, SimpleQA isolated factuality in brief, fact-seeking responses. This narrow focus provides a clearer picture of model accuracy on simple queries, though it remains to be seen if these findings generalize to more complex contexts. By open-sourcing SimpleQA, the authors contributed a foundational tool aimed at evaluating and enhancing factuality in frontier models.

Data Collection and Verification

Distribution of topics in SimpleQA. The topic for each question was classified via a prompted ChatGPT model.

The SimpleQA dataset was constructed in multiple stages to ensure accuracy and relevance, involving AI trainers who created precisely fact-seeking question-answer pairs. To qualify, each question required a single, verifiable answer, eliminating questions that could produce varied or subjective responses. For example, rather than asking open-ended questions like “where did Barack and Michelle Obama meet?” —which might yield different correct answers—questions were specific, such as “which city” or “which company.”

Trainers also avoided references that would quickly become outdated by specifying details without using terms like “as of 2023.” Each question was paired with an evidence-backed reference answer, verified independently by a second AI trainer, and only retained if both trainers agreed. Following these quality controls, only questions that stumped at least one of four GPT models during testing were kept, ensuring SimpleQA's difficulty level.

For additional quality control, SimpleQA used ChatGPT classifiers to identify and refine any ambiguous or error-prone questions, which trainers then revised. After completing the dataset, a third trainer cross-verified a random subset, revealing a 3% error rate mainly due to ambiguities or conflicting sources—highlighting the difficulty of achieving complete consensus even among verified data sources.

SimpleQA’s questions cover a diverse range of topics—from science to art and politics, tagged post hoc by ChatGPT, with nearly a third of answers being dates. Wikipedia was a predominant source, supplemented by other reputable sites, ensuring a broad knowledge base. The grading system used ChatGPT to classify responses as correct, incorrect, or not attempted, ultimately yielding metrics like an F-score based on precision and recall. This robust grading method is key to ensuring that SimpleQA provides a dependable measure of model accuracy and thus helps address key challenges in improving model factuality.

Model Performance and Calibration

The evaluation analyzed the performance and calibration of several OpenAI and Anthropic models on the SimpleQA dataset. As expected, larger models like GPT-4o outperformed smaller versions, such as GPT-4o-mini, with similar trends observed across Anthropic's Claude models. Larger models from both providers performed better than their smaller counterparts, though Claude models tended to attempt fewer questions than GPT-4o, which impacted their F-scores without significantly lowering their comparative performance.

Calibration testing was also central, specifically assessing how well models aligned confidence with accuracy. Calibration was measured in two ways: by directly asking models to state their confidence in answers and by evaluating consistency across repeated attempts at the same question. Results showed that larger models like o1-preview and GPT-4o were generally more calibrated than their smaller counterparts, although all models tended to overstate confidence, pointing to room for improvement. The second calibration measure—answer consistency across 100 repeated attempts—further confirmed that models like o1-preview demonstrated the highest calibration, outperforming others on both stated confidence and consistency.

Conclusion

In conclusion, researchers introduced SimpleQA as a benchmark to evaluate the factual accuracy of LMs on concise, fact-focused questions. Unlike previous benchmarks, SimpleQA’s questions are designed for clarity and verifiability, reducing grading complexity. The dataset, consisting of 4,326 questions, was created to challenge models with adversarial queries aimed at improving factuality.

Evaluations of several OpenAI and Anthropic models showed that larger models tended to perform better but still exhibited calibration issues, particularly overconfident answers. The study underscores the importance of better calibration in future models and establishes SimpleQA as a valuable tool for driving advancements in the factual reliability of AI systems.

Sources:

Introducing SimpleQA - https://openai.com/index/introducing-simpleqa/
SimpleQA can be found at https://github.com/openai/simple-evals.

Journal reference:

Measuring short-form factuality in large language models - https://cdn.openai.com/papers/simpleqa.pdf

Posted in: AI Research News

Comments (0)

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2024, November 12). Can Language Models Stop Making Stuff Up? New OpenAI Benchmark Puts AI to the Test. AZoAi. Retrieved on July 01, 2025 from https://www.azoai.com/news/20241112/Can-Language-Models-Stop-Making-Stuff-Up-New-OpenAI-Benchmark-Puts-AI-to-the-Test.aspx.
MLA
Nandi, Soham. "Can Language Models Stop Making Stuff Up? New OpenAI Benchmark Puts AI to the Test". AZoAi. 01 July 2025. <https://www.azoai.com/news/20241112/Can-Language-Models-Stop-Making-Stuff-Up-New-OpenAI-Benchmark-Puts-AI-to-the-Test.aspx>.
Chicago
Nandi, Soham. "Can Language Models Stop Making Stuff Up? New OpenAI Benchmark Puts AI to the Test". AZoAi. https://www.azoai.com/news/20241112/Can-Language-Models-Stop-Making-Stuff-Up-New-OpenAI-Benchmark-Puts-AI-to-the-Test.aspx. (accessed July 01, 2025).
Harvard
Nandi, Soham. 2024. Can Language Models Stop Making Stuff Up? New OpenAI Benchmark Puts AI to the Test. AZoAi, viewed 01 July 2025, https://www.azoai.com/news/20241112/Can-Language-Models-Stop-Making-Stuff-Up-New-OpenAI-Benchmark-Puts-AI-to-the-Test.aspx.