LLM Hallucination Index

RAG SPECIAL

Brought to you byGalileo Logo

A Ranking & Evaluation Framework For LLM Hallucinations

Get The Full Report

Welcome to the
Hallucination Index!

The LLM landscape has changed a lot since launching our first Hallucination Index in November 2023, with larger, more powerful open and closed-sourced models being announced monthly. Since then, two things happened: the term "hallucinate" became Dictionary.com’s Word of the Year, and Retrieval-Augmented-Generation (RAG) has become one of the leading methods for building AI solutions. And while the parameters and context lengths of these models continue to grow, the risk of hallucinations remains.

Our new Index evaluates how well 22 of the leading models adhere to given context, helping developers make informed decisions about balancing price and performance. We conducted rigorous testing of top LLMs with input ranging from 1,000 to 100,000 tokens to answer the question of how well they perform across short, medium, and long context lengths. So let's dive deep into the insights. Welcome to the new Hallucination Index - RAG Special!

About the Index

What?

Adding additional context has emerged as a new way to improve RAG performance and reduce reliability on vector databases. So, we tested each LLM across three scenarios each with varying context length.

Short Context

Less than 5k tokens
equivalent to RAG on few pages

Medium Context

5k to 25k tokens
equivalent to RAG on a book chapter

Long Context

40k to 100k tokens
equivalent to RAG on a book

Learn more about task type selection

How?

We followed the following steps when testing each LLM:

1. We gathered diverse datasets reflecting real-world scenarios across three different context lengths.

2. We employed a high-performance evaluation model, called Context Adherence, to measure factual accuracy and closed-domain hallucinations - cases where the model said things that were not provided in the context data.

Learn more about Context Adherence evaluation metric and the ChainPoll evaluation method.

10

closed-source
models

12

open-source
models

3

RAG
tasks

Trends

01

Open source is closing the gap

While closed-source models still offer the best performance thanks to proprietary training data, open-source models like Gemma, Llama, and Qwen continue to improve in hallucination performance without the cost barriers of their close-sourced counterparts.

02

What context length?

We were surprised to find models perform particularly well with extended context lengths without losing quality or accuracy, reflecting how far model training and architecture has come.

03

Larger is not always better

In certain cases, smaller models outperformed larger models. For example Gemini-1.5-flash-001 outperformed larger models, which suggests that efficiency in model design can sometimes outweigh scale.

04

Anthropic outperforms OpenAl

During testing, Anthropic's latest Claude 3.5 Sonnet scored close to perfect, beating out o1 and GPT-4o in shorter context scenarios while being cost effective.

Top Models for RAG Applications

Short Context RAG
(<5k tokens))

Learn more
Category
ScoreInfo icon

Best closed source model

0.97

claude-3-5-sonnet

Best open source model

0.97

llama-3.1-405b-instruct

Best affordable model

0.94

gemini-1.5-flash

Medium Context RAG
(5k to 25k tokens))

Learn more
Category
ScoreInfo icon

Best closed source model

1.00

gemini-1.5-flash

Best open source model

1.00

llama-3.1-8b-instruct

Best affordable model

1.00

gemini-1.5-flash

Long Context RAG
(40k to 100k tokens))

Learn more
Category
ScoreInfo icon

Best closed source model

1.0

gpt-4o-mini

Best affordable model

0.92

gemini-1.5-flash

Model Performance on RAG Tasks

Short Context RAG (SCR)

Less than 5k tokens

The Short Context RAG seeks to identify the most efficient model for understanding contexts up to 5k tokens. Its primary goal is to detect any loss of information or reasoning capability within these contexts. Similar to referencing select pages in a book, this method is especially suitable for tasks that demand domain-specific knowledge.

Filters

Pricing for open source models is based on Together.ai as of July 10

Result Snapshot

Category
Provider
Model name
ScoreInfo icon

Best closed source model

claude-3-5-sonnet

0.97

Lagging closed-source model

claude-3-haiku

0.92

Best open-source model

llama-3.1-405b-instruct

0.97

Lagging open-source model

qwen2-7b-instruct

0.75

Best performance for the cost

gemini-1.5-flash

0.94

RAG Task Examples

Model

Llama-3.1-70b-chat

Is response correct?

Yes

Context Adherence

1

prompt

For every 100 females age 18 and over, there were 94.5 males.

response

Based on the information, females are the larger gender group. For every 100 females, there were 97.3 males, and for every 100 females age 18 and over, there were 94.5 males. This indicates that there are more females than males in the population.

Answer: Female

Model

Gemma-2-9b-it

Is response correct?

No

Context Adherence

0

prompt

For every 100 females age 18 and over, there were 94.5 males.

response

Male

Answer: Female

Medium Context RAG (MCR)

5k to 25k tokens

The Medium Context RAG aims to determine the most effective model for comprehending long contexts spanning from 5k to 25k tokens. It focuses on identifying any loss of information and reasoning ability within these extensive contexts. Additionally, we experiment with a prompting technique known as Chain-of-Note to improve performance as it has worked for short context. This task is akin to doing RAG on a few book chapters.

Filters

Pricing for open source models is based on Together.ai as of July 10

Result Snapshot

Category
Provider
Model name
ScoreInfo icon

Best closed-source model

gemini-1.5-flash

1.00

Lagging closed-source model

claude-3-haiku

0.96

Best open-source model

llama-3.1-8b-instruct

1.00

Lagging open-source model

dbrx-instruct

0.95

Best performance for the cost

gemini-1.5-flash

1.00

Recall heatmap

This heatmap shows the model's ability to recall information in different parts of the context. The x-axis represents the length of the context during the experiment, and the y-axis represents the location of the information. Green indicates successful recall, while red indicates failure.

Heatmap for claude-3-5-sonnet-20240620

claude-3-5-sonnet-20240620

RAG Task Examples

Model

Gemini Flash

Is response correct?

Yes

Context Adherence

1

prompt

Export controls targeting GPUs and semiconductors associated with AI, which have been imposed and are increasingly likely to be further tightened, would further restrict our ability to export our technology, products, or services even though competitors may not be subject to similar restrictions, creating a competitive disadvantage for us and negatively impacting our business and financial results.

response

The answer is **2.7%**. This information is found in **Document 2**, which states: "In fiscal year 2024, our overall turnover rate was 2.7%.

Answer: 2.7%

Model

Qwen-2-7b-Instruct

Is response correct?

No

Context Adherence

0

prompt

Export controls targeting GPUs and semiconductors associated with AI, which have been imposed and are increasingly likely to be further tightened, would further restrict our ability to export our technology, products, or services even though competitors may not be subject to similar restrictions, creating a competitive disadvantage for us and negatively impacting our business and financial results.

response

The documents do not provide the information about the percentage of employees who left the company during FY 2024.

Answer: 2.7%

Long Context RAG (LCR)

40k to 100k tokens

This heatmap shows the model's ability to recall information in different parts of the context. The x-axis represents the length of the context during the experiment, and the y-axis represents the location of the information. Green indicates successful recall, while red indicates failure.

Filters

Result Snapshot

Category
Provider
Model name
ScoreInfo icon

Best closed-source model

gpt-4o-mini

1.00

Lagging closed-source model

claude-3-haiku

0.70

Best performance for the cost

gpt-4o-mini

1.00

Recall heatmap

As the name suggests, this heatmap indicates the model's success in recalling information at different locations in the context. Green signifies success, while red indicates failure.

Heatmap for claude-3-opus-20240229

claude-3-opus-20240229

RAG Task Examples

Model

Claude 3.5 Sonnet

Is response correct?

Yes

Context Adherence

1

prompt

Export controls targeting GPUs and semiconductors associated with AI, which have been imposed and are increasingly likely to be further tightened, would further restrict our ability to export our technology, products, or services even though competitors may not be subject to similar restrictions, creating a competitive disadvantage for us and negatively impacting our business and financial results.

response

According to the document, NVIDIA's overall employee turnover rate in fiscal year 2024 was 2.7%.

Answer: 2.7%

Model

Mistral Large

Is response correct?

No

Context Adherence

0

prompt

Export controls targeting GPUs and semiconductors associated with AI, which have been imposed and are increasingly likely to be further tightened, would further restrict our ability to export our technology, products, or services even though competitors may not be subject to similar restrictions, creating a competitive disadvantage for us and negatively impacting our business and financial results.

response

The overall turnover rate was 5.3%.

Answer: 2.7%

Model comparison

Dataset
Model InfoDeveloperAnthropicAnthropic
LicenseNA (private model)NA (private model)
Model ParametersNA (private model)NA (private model)
Supported context length200k200k
$/M prompt tokens30.25
$/M response tokens151.25
Cost insightPrompt cost compared to Claude 3.5 Sonnet1.00x0.08x
Prompt cost compared to gpt-4o-2024-08-061.20x0.10x
Prompt cost compared to Gemini 1.5 Pro0.86x0.07x
Prompt cost compared to Meta-Llama-3.1-70B-Instruct3.41x0.28x
PerformanceShort context RAG
0.97
0.92
Medium context RAG
1.00
0.96
Long context RAG
1.00
0.70

Performance on datasets

Here is the performance of models on each dataset. The datasets are selected to test for different capabilities ranging from robustness to noise to the ability to do math.

Read the full report