AI can excel at certain tasks. I like coding. Or Creating a podcast.. But a new paper finds that it struggles to pass a high-stakes history exam.
A team of researchers used OpenAI's GPT-4 for historical queries. Meta's Llama and Google's Gemini have created a new benchmark for testing three leading language models (LLMs). criteria, Hist-LLM tests the correctness of answers against the Seshat Global History Databank, a database of historical knowledge named after the ancient Egyptian goddess of wisdom.
the results, were presented. At last month's NeurIPS advanced AI conference, researchers said it was disappointing. Center for Complexity Science (CSH) is a research institute based in Austria. The best-performing LLM was GPT-4 Turbo, but only achieved about 46% accuracy — not much higher than random guesses.
“The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for modern history. They are good for basic information, but when it comes to more specialized PhD-level historical inquiries, they are still not up to the task,” said Maria del Rio-Chanona, co-author of the paper and a co-author. Professor of Computer Science at University College London.
The researchers shared sample history questions with TechCrunch that LLMs get wrong. For example, GPT-4 Turbo asked if ancient Egypt was armed with scale during a specific time period. LLM is yes. But the technology did not appear until 1500 years later in Egypt.
Why are LLMs so bad at answering technical history questions when they are so good at answering very complex questions about things like coding? Del Rio-Chanona says that this is likely because LLMs tend to extrapolate from very obvious historical information, and the difficulty of retrieving clearer historical knowledge.
For example, The researchers asked GPT-4 whether ancient Egypt had a professional army during a specific historical period. The correct answer is no, but the LLM answer is incorrect. There is a lot of public information about other ancient empires, such as Persia, probably having armies.
“If you say A and B 100 times and C 1 time, and then ask a question about C, they'll remember A and B and try to reproduce that,” del Rio-Chanona said.
The researchers identified other trends, including that the OpenAI and Llama models performed worse for some regions, such as sub-Saharan Africa, suggesting possible bias in their training data.
The results show that LLMs are still no substitute for humans when it comes to certain domains, said Peter Turchin, a professor at CSH who led the study.
But researchers still hope that LLMs can help historians in the future. They are working to revise their criteria by including more data from underrepresented regions and adding more complex questions.
“Overall, our results highlight areas where LLMs need to improve, but also highlight the potential for these models to contribute to historical research,” the paper states.