LLM Hallucination Index: RAG Special

LLM Hallucination Index: RAG Special
Osman Javed
Osman JavedVP of Marketing
Pratik Bhavsar
Pratik BhavsarGalileo Labs
less than a minute readJuly 29 2024

We are excited to introduce the second installation of Galileo's Hallucination Index, the RAG Special!

The LLM landscape has changed a lot since launching our first Hallucination Index in November 2023, with larger more powerful open and closed-sourced models being announced monthly. Since then, two things happened: the term "hallucinate" became Dictionary.com’s Word of the Year, and Retrieval-Augmented-Generation (RAG) has become the leading method for building AI solutions. And while models continue to increase in size and performance, the risk of hallucinations remains.

For our second installation of the Hallucination Index, we tested 22 of the leading foundation models from brands like OpenAI, Anthropic, Meta, Google, and more in real-world RAG-based use cases.

Context Length and Model Performance

Given the growing popularity of RAG, understanding how context length impacts model performance was a key focus. We tested the models across three scenarios:

  • Short Context: Models were provided with less than 5,000 tokens, equivalent to a few pages of information.
  • Medium Context: Models received between 5,000 and 25,000 tokens, comparable to a book chapter.
  • Long Context: Models handled 40,000 to 100,000 tokens, akin to the content of an entire book.

Open-Source vs. Closed-Source Models

The open-source vs. closed-source software debate has waged on since the Free Software Movement (FSM) in the late 1980s. This debate has reached a fever pitch during the LLM Arms Race. The assumption is closed-source LLMs, with their access to proprietary training data, will perform better, but we wanted to put this assumption to the test.

Measuring Performance with Galileo's Context Adherence Evaluation Model

LLM performance was measured using Galileo's Context Adherence Evaluation Model. Context Adherence uses a proprietary method created by Galileo Labs called ChainPoll to measure how well an AI model adheres to the information it is given, helping spot when AI makes up information that is not in the original text.

We hope this index helps AI builders make informed decisions about which LLM is best suited for their particular use case and need.

With that, dig into the rankings, model-specific insights, and our methodology at www.rungalileo.io/hallucinationindex.