Google DeepMind researchers introduce new criterion to improve LLM reality, reduce hallucinations


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. learn more


Hallucinationsor factually incorrect answers, continue to plague large language models (LLMs). Models fail especially when they get more complex tasks and when users are looking for specific and very detailed answers.

It's a challenge that data scientists have struggled to overcome, and now, researchers from Google DeepMind say they have come a step closer to achieving true reality in basic models. They have introduced FACTS Grounding, a benchmark that assesses LLMs' ability to generate factual responses based on long-form documents. Models are also judged on whether their responses are detailed enough to provide useful, relevant responses to stimuli.

Along with the new criterion, the researchers have a FACTS board of directors to the Kaggle data science community.

As of this week, Gemini 2.0 Flash topped the leaderboard, with an 83.6% accuracy score. Others in the top 9 include Google's Gemini 1.0 Flash and Gemini 1.5 Pro; Clade Anthropic 3.5 Sonnet and Claude 3.5 Haiku; and OpenAI's GPT-4o, 4o-mini, o1-mini and o1-preview. All of these were above 61.7% in terms of accuracy.

The researchers say that the leaderboard will be actively maintained and continuously updated to include new models and their different versions.

“We believe that this benchmark fills a gap in evaluating a wider variety of reality-related model behaviors, as opposed to benchmarks that focus on narrower use cases… such as summary only,” the researchers write in a technical paper released this week.

Eliminating incorrect answers

Making sure factual accuracy in LLM answers are difficult due to modeling (architecture, training and decision) and measurement (evaluation methodology, data and metrics) factors. Usually, researchers point out, that pre-training aims to predict the next signal with previous signals.

“While this goal may teach models a certain knowledge of the world, it does not directly optimize the model towards different reality situations, instead encourage the model to generate in general. plausible text,” the researchers wrote.

To address this, the FACTS database includes 1,719 examples – 860 public and 859 private – each requiring long-form responses based on context in provided documents. Each example includes:

  • System prompting (system_instruction) with general instructions and the order to respond only based on a given context;
  • Action (user_request) which contains a specific question to be answered;
  • A long document (context_document) with required information.

To succeed and be known as “wrong,” the model they need to process the long-form document and generate a subsequent long-form response that is both complete and fully attributeable to the document. Answers are labeled “incorrect” if the model's claims are not directly supported by the paper and are not highly relevant or useful.

For example, a user can ask a model to summarize the main reasons why a company's revenue decreased in Q3, and provide it with detailed information including a company's annual financial report. ' consider seasonal earnings, costs, planned investments and market analysis.

If a model then returned, say, “The company faced challenges in Q3 that affected its revenue,” it would be considered inaccurate.

“The response avoids specifying any reasons, such as market trends, increased competition or operational obstacles, which would likely be in the document,” the researchers said. “It does not indicate an attempt to go engage with or extract relevant details”.

In contrast, if a user prompts, “What are some tips on saving money?” and providing a collection of money-saving tips for college students, a correct answer would be very specific: “Take advantage of free activities on campus, buy groceries in bulk and cook at home. Also, set spending goals, avoid credit cards and save resources.”

DeepMind uses LLMs to diagnose LLMs

To allow for diverse input, the researchers included documents of varying lengths, up to 32,000 characters (or the equivalent of 20,000 words). These cover areas including finance, technology, retail, medicine and law. User requests are also extensive, including Q&A generation, summarizing and rewriting requests.

Each example is judged in two stages. First, responses are evaluated in terms of eligibility: If they do not satisfy user requests, they are disqualified. Second, answers must be free of hallucination and fully grounded in the documents provided.

These truth scores are calculated by three different LLM judges – specifically Gemini 1.5 Pro, GPT-4o and Claude 3.5 Sonnet – who determine individual scores based on the percentage of correct model results. After that, the final truth decision is based on the average of the scores of the three judges.

Researchers indicate that models are often biased towards other members of their model family – at an average increase of about 3.23% – so the mix of different judges was essential to ensure that answers were truthful.

In the end, the researchers emphasize that reality and foundation are key factors for the success and usefulness of LLMs in the future. “We believe that comprehensive benchmarking techniques, along with ongoing research and development, will continue to improve AI systems,” they write.

However, they also admit: “We are aware that progress can quickly overtake benchmarks, so the launch of our FACTS Grounding benchmark and leaderboard is just the beginning.”



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *