Brought to you by
A Ranking & Evaluation Framework For LLM Hallucinations
Get The Full ReportThe LLM landscape has changed a lot since launching our first Hallucination Index in November 2023, with larger, more powerful open and closed-sourced models being announced monthly. Since then, two things happened: the term "hallucinate" became Dictionary.com’s Word of the Year, and Retrieval-Augmented-Generation (RAG) has become one of the leading methods for building AI solutions. And while the parameters and context lengths of these models continue to grow, the risk of hallucinations remains.
Our new Index evaluates how well 22 of the leading models adhere to given context, helping developers make informed decisions about balancing price and performance. We conducted rigorous testing of top LLMs with input ranging from 1,000 to 100,000 tokens to answer the question of how well they perform across short, medium, and long context lengths. So let's dive deep into the insights. Welcome to the new Hallucination Index - RAG Special!
Adding additional context has emerged as a new way to improve RAG performance and reduce reliability on vector databases. So, we tested each LLM across three scenarios each with varying context length.
Short Context
Less than 5k tokens
equivalent to RAG on few pages
Medium Context
5k to 25k tokens
equivalent to RAG on a book chapter
Long Context
40k to 100k tokens
equivalent to RAG on a book
We followed the following steps when testing each LLM:
1. We gathered diverse datasets reflecting real-world scenarios across three different context lengths.
2. We employed a high-performance evaluation model, called Context Adherence, to measure factual accuracy and closed-domain hallucinations - cases where the model said things that were not provided in the context data.
Learn more about Context Adherence evaluation metric and the ChainPoll evaluation method.
closed-source
models
open-source
models
RAG
tasks
While closed-source models still offer the best performance thanks to proprietary training data, open-source models like Gemma, Llama, and Qwen continue to improve in hallucination performance without the cost barriers of their close-sourced counterparts.
We were surprised to find models perform particularly well with extended context lengths without losing quality or accuracy, reflecting how far model training and architecture has come.
In certain cases, smaller models outperformed larger models. For example Gemini-1.5-flash-001 outperformed larger models, which suggests that efficiency in model design can sometimes outweigh scale.
During testing, Anthropic's latest Claude 3.5 Sonnet scored close to perfect, beating out o1 and GPT-4o in shorter context scenarios while being cost effective.
Best performing model
Claude 3.5 Sonnet due to great performance on all tasks with context support up to 200k.
Best performance for the cost
GPT-4o-mini due to near flawless performance for all tasks at affordable price.
Best performing open-source model
Qwen2-72B-Instruct due to great performance in SCR and MCR with context support up to 128k.
Less than 5k tokens
The Short Context RAG seeks to identify the most efficient model for understanding contexts up to 5k tokens. Its primary goal is to detect any loss of information or reasoning capability within these contexts. Similar to referencing select pages in a book, this method is especially suitable for tasks that demand domain-specific knowledge.
Pricing for open source models is based on Together.ai as of July 10
Model
Llama-3.1-70b-chat
Is response correct?
Yes
Context Adherence
1
For every 100 females age 18 and over, there were 94.5 males.
Based on the information, females are the larger gender group. For every 100 females, there were 97.3 males, and for every 100 females age 18 and over, there were 94.5 males. This indicates that there are more females than males in the population.
Answer: Female
Model
Gemma-2-9b-it
Is response correct?
No
Context Adherence
0
For every 100 females age 18 and over, there were 94.5 males.
Male
Answer: Female
5k to 25k tokens
The Medium Context RAG aims to determine the most effective model for comprehending long contexts spanning from 5k to 25k tokens. It focuses on identifying any loss of information and reasoning ability within these extensive contexts. Additionally, we experiment with a prompting technique known as Chain-of-Note to improve performance as it has worked for short context. This task is akin to doing RAG on a few book chapters.
Pricing for open source models is based on Together.ai as of July 10
This heatmap shows the model's ability to recall information in different parts of the context. The x-axis represents the length of the context during the experiment, and the y-axis represents the location of the information. Green indicates successful recall, while red indicates failure.
Heatmap for claude-3-5-sonnet-20240620
Model
Gemini Flash
Is response correct?
Yes
Context Adherence
1
Export controls targeting GPUs and semiconductors associated with AI, which have been imposed and are increasingly likely to be further tightened, would further restrict our ability to export our technology, products, or services even though competitors may not be subject to similar restrictions, creating a competitive disadvantage for us and negatively impacting our business and financial results.
The answer is **2.7%**. This information is found in **Document 2**, which states: "In fiscal year 2024, our overall turnover rate was 2.7%.
Answer: 2.7%
Model
Qwen-2-7b-Instruct
Is response correct?
No
Context Adherence
0
Export controls targeting GPUs and semiconductors associated with AI, which have been imposed and are increasingly likely to be further tightened, would further restrict our ability to export our technology, products, or services even though competitors may not be subject to similar restrictions, creating a competitive disadvantage for us and negatively impacting our business and financial results.
The documents do not provide the information about the percentage of employees who left the company during FY 2024.
Answer: 2.7%
40k to 100k tokens
This heatmap shows the model's ability to recall information in different parts of the context. The x-axis represents the length of the context during the experiment, and the y-axis represents the location of the information. Green indicates successful recall, while red indicates failure.
As the name suggests, this heatmap indicates the model's success in recalling information at different locations in the context. Green signifies success, while red indicates failure.
Heatmap for claude-3-opus-20240229
Model
Claude 3.5 Sonnet
Is response correct?
Yes
Context Adherence
1
Export controls targeting GPUs and semiconductors associated with AI, which have been imposed and are increasingly likely to be further tightened, would further restrict our ability to export our technology, products, or services even though competitors may not be subject to similar restrictions, creating a competitive disadvantage for us and negatively impacting our business and financial results.
According to the document, NVIDIA's overall employee turnover rate in fiscal year 2024 was 2.7%.
Answer: 2.7%
Model
Mistral Large
Is response correct?
No
Context Adherence
0
Export controls targeting GPUs and semiconductors associated with AI, which have been imposed and are increasingly likely to be further tightened, would further restrict our ability to export our technology, products, or services even though competitors may not be subject to similar restrictions, creating a competitive disadvantage for us and negatively impacting our business and financial results.
The overall turnover rate was 5.3%.
Answer: 2.7%
Dataset | |||
---|---|---|---|
Model Info | Developer | Anthropic | Anthropic |
License | NA (private model) | NA (private model) | |
Model Parameters | NA (private model) | NA (private model) | |
Supported context length | 200k | 200k | |
$/M prompt tokens | 3 | 0.25 | |
$/M response tokens | 15 | 1.25 | |
Cost insight | Prompt cost compared to Claude 3.5 Sonnet | 1.00x | 0.08x |
Prompt cost compared to gpt-4o-2024-08-06 | 1.20x | 0.10x | |
Prompt cost compared to Gemini 1.5 Pro | 0.86x | 0.07x | |
Prompt cost compared to Meta-Llama-3.1-70B-Instruct | 3.41x | 0.28x | |
Performance | Short context RAG | 0.97 | 0.92 |
Medium context RAG | 1.00 | 0.96 | |
Long context RAG | 1.00 | 0.70 |
Here is the performance of models on each dataset. The datasets are selected to test for different capabilities ranging from robustness to noise to the ability to do math.