A breakthrough tool called FactSelfCheck zooms in on the smallest units of AI-generated content—individual facts—to pinpoint and fix hallucinations with surgical precision, offering a powerful upgrade to how we catch and correct falsehoods in language models.
Research: FactSelfCheck: Fact-Level Black-Box Hallucination Detection for LLMs. Image Credit: Andrii Yalanskyi / Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In a major stride toward improving the factual accuracy of AI-generated content, researchers from Wrocław University of Science and Technology in Poland and the University of Technology Sydney have unveiled a cutting-edge method called FactSelfCheck. This new tool, described in a preprint* paper published on arXiv, zeroes in on hallucinations generated by large language models (LLMs) like GPT-4 or Claude—not just at the sentence or passage level, as current systems do, but at the level of individual facts.
Why AI Hallucinations Are a Real-World Problem
Hallucinations are a growing concern in the AI world. These occur when models output information that may sound plausible but is entirely false or unsubstantiated. While sometimes these errors are obvious, in other contexts—particularly in health, law, or science—they can be dangerously misleading. As AI continues to be embedded into professional tools and public services, the need to detect and correct hallucinated content becomes critical. Most existing detection systems aim to classify entire sentences or paragraphs as factual or false. The downside to this approach is that a single sentence can contain multiple facts, some true and some hallucinated. Treating the entire sentence as wrong doesn't offer the nuance needed for precise error correction.
Zooming in on the Individual Fact
FactSelfCheck addresses this shortfall by dissecting AI-generated text into its most basic units of meaning—facts—and testing each one for consistency. The system works by first generating multiple samples of AI responses to a prompt, then extracting structured facts from these samples in the form of "triples," such as (Albert Einstein, born in Ulm). It then compares the presence and consistency of each fact across the different samples. If a particular fact appears regularly across many outputs, it’s likely reliable. If it shows up rarely or not at all, it's flagged as a potential hallucination.
Built for Compatibility, Accuracy, and Efficiency
What’s particularly noteworthy is that this new method operates as a black-box system. That means it doesn't require any access to the inner workings of the AI models themselves, making it usable even for closed models like GPT-4. It also doesn’t depend on external databases or prior training, which allows it to work right out of the box in any domain. Because it is zero-shot and non-parametric, FactSelfCheck can be easily deployed across domains without needing domain-specific training data or parameter tuning. This universality sets it apart from many state-of-the-art methods that either need finely tuned parameters or rely on external knowledge sources.
The researchers developed two main variants of FactSelfCheck. One version builds knowledge graphs from the text to assess consistency through structured comparisons. The other operates directly on raw text, skipping the graph-building process to save on computational resources. Despite the simplification, the text-based variant remains highly effective and, in some cases, outperforms more complex approaches. The knowledge-graph-based variant even uses semantic reasoning, asking an LLM whether a fact is supported by a generated graph, going beyond literal matches.
Strong Performance in Benchmark Testing
To evaluate their method, the team used the WikiBio GPT-3 Hallucination Dataset—a collection of Wikipedia-style biographies specifically curated to test hallucination detection systems. Compared against the leading benchmark tool, SelfCheckGPT, FactSelfCheck delivered nearly equivalent accuracy in sentence-level detection. It achieved an area-under-curve (AUC-PR) score of 92.45 compared to 93.60 for SelfCheckGPT. For passage-level detection, it also demonstrated strong correlation scores, signaling that its granular, fact-focused design can scale up to higher levels of analysis when needed. The researchers primarily tested their method using Llama-3.1-70B-Instruct and used GPT-4o for the correction experiments, showcasing the method’s compatibility across different models.
Precision Correction Unlocks New Possibilities
The real strength of FactSelfCheck emerged in a series of experiments where the system was used not only to detect hallucinations but to correct them. When given a list of hallucinated facts rather than vague indications of problematic sentences, a separate AI model was far more successful in rewriting responses to be accurate. The researchers found that corrections based on fact-level insights led to a 35 percent increase in factual content, compared to just 8 percent for sentence-level approaches. This finding underscores a simple but powerful truth: the more precisely we can diagnose the hallucination, the better we can fix it.
Challenges Still to Overcome
While the tool shows enormous promise, the researchers acknowledge current limitations. For one, their evaluation had to rely on sentence-level labels since there are no public datasets yet that offer annotations at the fact level. This means their system's core strength—fact-level accuracy—can’t be fully validated in current benchmarks. Creating such datasets will be essential for the next phase of research. Another limitation is computational cost. FactSelfCheck requires generating multiple responses from a language model and then running those outputs through further analysis. Although the pipeline is efficient in its own right, it demands significantly more resources than simpler sentence-level methods. That said, the team believes this cost is justified by the precision and reliability their tool offers, especially in high-stakes applications.
A Roadmap for Future Innovation
Looking ahead, they see opportunities to optimize the pipeline further by merging prompts, improving fact extraction, and potentially incorporating verified knowledge from external sources such as Wikipedia or Wikidata. Such enhancements could allow the system to work faster and verify facts against real-world information, further reducing the chance of undetected hallucinations.
FactSelfCheck is also well-positioned to serve as a transparency tool for AI developers and policymakers. As generative AI models become increasingly embedded in society, from assisting doctors to drafting legislation, the risks of subtle misinformation grow. A tool that can identify not just that an answer is wrong but exactly which claim is incorrect and why brings us closer to making AI systems truly accountable.
Pushing AI Toward Truthfulness, One Fact at a Time
The team has made the code for FactSelfCheck openly available, inviting further development and integration into existing AI systems. This openness aligns with the broader movement in AI ethics toward explainable, trustworthy systems that can be audited by users and researchers alike. With regulatory conversations heating up globally and questions around AI reliability becoming ever more urgent, FactSelfCheck arrives at a pivotal moment.
By shifting the focus from vague sentence judgments to sharp, structured fact analysis, the researchers behind FactSelfCheck have made a compelling case for a more forensic approach to understanding AI output. Their work not only raises the bar for hallucination detection but also opens the door to smarter correction tools and safer applications of generative AI. It’s a reminder that while AI may not always tell the truth, we now have a better way to tell when it’s lying.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Sawczyn, A., Binkowski, J., Janiak, D., Gabrys, B., & Kajdanowicz, T. (2025). FactSelfCheck: Fact-Level Black-Box Hallucination Detection for LLMs. ArXiv. https://arxiv.org/abs/2503.17229