Despite advancements, AI chatbots still struggle to grasp complex historical narratives, revealing biases and limitations that hinder their reliability in academic research.
Research: Large Language Models' Expert-level Global History Knowledge Benchmark (HiST-LLM). Image Credit: Triff / Shutterstock
For the past decade, complexity scientist Peter Turchin and his colleagues, including first author Jakob Hauser, have been working with collaborators to bring together the most current and structured body of knowledge about human history in one place: the Seshat Global History Databank. Over the past year, with computer scientist Maria del Rio-Chanona, he has begun to wonder if artificial intelligence chatbots could help historians and archaeologists gather data and better understand the past. As a first step, they wanted to assess the AI tools' understanding of historical knowledge.
In collaboration with an international team of experts from institutions such as the Complexity Science Hub, the University of Oxford, and the Alan Turing Institute, they evaluated the historical knowledge of advanced AI models such as ChatGPT-4, Llama, and Gemini.
"Large language models (LLMs), such as ChatGPT, have been enormously successful in some fields—for example, they have largely succeeded by replacing paralegals. But when it comes to making judgments about the characteristics of past societies, especially those located outside North America and Western Europe, their ability to do so is much more limited," says Turchin, who leads the Complexity Science Hub's (CSH) research group on social complexity and collapse.
Artificial "Intelligence" is Domain-Specific
"One surprising finding, which emerged from this study, was just how bad these models were. This result shows that artificial 'intelligence' is quite domain-specific. LLMs do well in some contexts, but very poorly, compared to humans, in others," adds Turchin.
The study results were presented recently at the NeurIPS conference, AI's premier annual gathering in Vancouver. The study used a four-choice test format that required models to determine whether historical facts were evidenced or inferred, a crucial distinction in historical analysis. GPT-4 Turbo, the best-performing model, scored 46% on a four-choice question test. According to Turchin and his team, although these results are an improvement over the baseline of 25% random guessing, they highlight the considerable gaps in AI's understanding of historical knowledge.
"I thought the AI chatbots would do a lot better," says del Rio-Chanona, the study's corresponding author. "History is often viewed as facts, but sometimes interpretation is necessary to make sense of it," adds del Rio-Chanona, an external faculty member at CSH and an assistant professor at University College London.
Setting a Benchmark for LLMs
This new assessment, the first of its kind, challenged these AI systems to answer questions at a graduate and expert level, similar to the ones answered in Seshat (using a multi-shot approach that provided examples to guide the models’ responses). Seshat is a vast, evidence-based resource that compiles historical knowledge across 600 societies worldwide, spanning over 36,000 data points and over 2,700 scholarly references.
"We wanted to set a benchmark for assessing the ability of these LLMs to handle expert-level history knowledge," explains first author Jakob Hauser, a resident scientist at CSH. "The Seshat Databank allows us to go beyond 'general knowledge' questions. A key component of our benchmark is testing whether AI models can distinguish between direct and indirect evidence, which is vital for historical accuracy."
Disparities Across Time Periods and Geographic Regions
The benchmark also reveals other important insights into the ability of current chatbots—seven models from the Gemini, OpenAI, and Llama families—to comprehend global history. For instance, they answered questions about ancient history most accurately, particularly from 8,000 BCE to 3,000 BCE. However, their accuracy dropped sharply for more recent periods, falling to 38.7% for the period 1,500 CE – 2,000 CE for GPT-4 Turbo.
In addition, the results highlight the disparity in model performance across geographic regions. GPT-4 Turbo outperformed other models in six out of eight global regions, while Llama performed best in North America, and OpenAI's models excelled in Latin America. Both OpenAI and Llama models performed worse in Sub-Saharan Africa, and Llama also performed poorly in Oceania. According to the study, this suggests potential biases in the training data, which may overemphasize certain historical narratives while neglecting others.
Better on the Legal System, Worse on Discrimination
The benchmark also found differences in performance across categories. Models performed best on legal systems and social complexity. GPT-4 Turbo achieved the highest scores in seven out of ten categories, while Llama performed relatively well in social complexity and warfare. "But they struggled with topics such as discrimination and social mobility," says del Rio-Chanona.
"The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history. They're great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they're not yet up to the task," adds del Rio-Chanona. According to the benchmark, the model that performed best was GPT-4 Turbo, with a balanced accuracy of 46%, while the weakest was Llama-3.1-8B, with 33.6%.
Next Steps
Del Rio-Chanona and the other researchers from CSH, the University of Oxford, and the Alan Turing Institute are committed to expanding the dataset and improving the benchmark. According to Hauser, they plan to include more data from underrepresented regions and incorporate more complex historical questions.
"We plan to continue refining the benchmark by integrating additional data points from diverse regions, especially the Global South. We are also collaborating with non-English-speaking institutions to enhance data representation and mitigate biases. We also look forward to testing more recent LLM models, such as o3, to see if they can bridge the gaps identified in this study," says Hauser.
The CSH scientist emphasizes that the benchmark's findings can be valuable to both historians and AI developers. For historians, archaeologists, and social scientists, knowing the strengths and limitations of AI chatbots can help guide their use in historical research. For AI developers, these results highlight areas for improvement, particularly in mitigating regional biases and enhancing the models' ability to handle complex, nuanced historical knowledge.
Sources:
Journal reference:
- “Large Language Models’ Expert-level Global History Knowledge Benchmark (HiST-LLM),” by Jakob Hauser, Daniel Kondor, Jenny Reddish, Majid Benam, Enrico Cioni, Federica Villa, James S. Bennett, Daniel Hoyer, Pieter Francois, Peter Turchin, and R. Maria del Rio-Chanona, was presented at the NeurIPS conference, in Vancouver, in December.