Mastering RAG: How To Evaluate LLMs For RAG

Welcome to yet another deep dive post in our Mastering RAG series!

Previously, we dove into the intricacies of building enterprise-level RAG systems, exploring RAG architecture, implementation strategies, and best practices for deploying these systems effectively. This time we aim to provide a comprehensive guide on how to assess the performance of LLMs in RAG systems, covering essential dimensions, metrics, and benchmarks. Whether you're a seasoned practitioner or new to the field, this guide will equip you with the knowledge and tools needed to ensure your RAG systems are robust, accurate, and reliable.

Short Introduction to RAG

RAG enhances the capabilities of LLMs by integrating a retrieval component. This component fetches relevant documents from a large corpus, which the LLM then uses to generate accurate and contextually appropriate responses. Unlike traditional LLMs that rely solely on pre-trained knowledge, RAG systems dynamically incorporate external information, making them more versatile and reliable. For example, a RAG system can retrieve the latest research papers for a medical query, ensuring that the response is based on the most current information available.

While fine-tuning involves adapting a pre-trained model to a specific task or domain through additional training, RAG systems leverage external databases to fetch relevant information in real-time. This approach mitigates the need for extensive fine-tuning and allows the model to access the most current information, reducing the risk of generating outdated or incorrect responses. For instance, a fine-tuned model on financial data may become outdated as time passes, whereas a RAG system can retrieve the latest financial statements to provide up-to-date answers.

Dimensions to Evaluate for RAG

When evaluating LLMs for RAG, several dimensions need to be considered to ensure comprehensive assessment:

Instruct or Chat

The first dimension involves determining whether the model is designed for instructional purposes or conversational interactions. Instructional models are typically more straightforward, focusing on providing information or performing tasks based on direct queries. Conversational models, on the other hand, need to handle multi-turn dialogues, maintain context, and provide coherent and relevant responses throughout the interaction. For example, a conversational model should be able to remember previous interactions and provide contextually relevant answers in a customer support scenario.

Context Length

The ability of a model to handle varying context lengths is crucial. Short context lengths are easier to manage but may not provide sufficient information for complex queries. Long context lengths, while more informative, pose challenges in terms of memory and processing power. Evaluating how well a model performs across different context lengths helps in understanding its robustness and scalability. For instance, a legal document may require a model to process thousands of tokens to provide a comprehensive answer.

Domain

Different domains have unique requirements and challenges. For instance, legal documents require precise language and adherence to specific terminologies, while medical texts demand high accuracy and sensitivity to context. Evaluating a model's performance across various domains ensures its versatility and reliability in real-world applications. For example, a model trained on general knowledge may not perform well in specialized domains like finance or healthcare without proper evaluation and adaptation.

Tabular Data QA

Handling tabular data is another critical dimension. Many real-world applications involve data presented in tables, requiring the model to perform operations like filtering, sorting, and numerical calculations. Evaluating the model's ability to comprehend and reason over tabular data is essential for tasks in finance, healthcare, and other data-intensive fields. For instance, answering a question about financial reports may require the model to interpret and analyze complex tables.

Robustness to Noise

Noise robustness measures the model's ability to filter out irrelevant information and focus on the pertinent details. This is particularly important in scenarios where the retrieved documents contain a mix of relevant and irrelevant information. For example, a model should be able to extract useful information from a noisy dataset containing both relevant research papers and unrelated articles.

Counterfactual Robustness

Counterfactual robustness assesses the model's ability to identify and handle incorrect or misleading information in the retrieved documents. This ensures that the model can provide accurate responses even when faced with erroneous data. For instance, a model should be able to detect and disregard incorrect information in a news article to provide a reliable answer.

Negative Rejection

Negative rejection evaluates whether the model can recognize when it does not have sufficient information to answer a query and appropriately decline to provide an answer. This is crucial for maintaining the reliability and trustworthiness of the system. For example, a model should be able to indicate that it cannot answer a question about a recent event if it does not have access to up-to-date information.

Information Integration

Information integration measures the model's ability to synthesize information from multiple documents to provide a comprehensive answer. This is particularly important for complex queries that cannot be answered by a single document. For instance, answering a question about the impact of a new law may require integrating information from multiple legal texts and expert opinions.

Information Update

Evaluating the model's ability to handle information which becomes stale with the help of fresh context is needed for generative web search. This ensures that the model can provide up-to-date and accurate responses. For example, a model should be able to handle the latest news articles to answer a question about current events accurately discarding its internal memory from pretraining.

Challenges in RAG Evaluation

Everyone wants great evals. However, the process is fraught with difficulties, including subjective biases, high costs, and technical limitations. This section delves into the key challenges faced in LLM evaluation, focusing on issues with vibe-check based evaluations, the use of LLMs as judges, and inherent biases that can skew results.

Issues with Vibe-Check Based Evaluation

The "vibe check" approach to evaluating LLMs involves subjective human judgments, often through A/B testing with crowd workers. While this method simulates real-world interactions and helps rank models based on usefulness and harmlessness, it has serious limitations.

Expensive: Conducting human evaluation is expensive and time-consuming, requiring coordination with annotators, custom web interfaces, detailed annotation instructions, data analysis, and considerations for employing crowdworkers.

The recent paper - Human Feedback is not Gold Standard - highlights limitations and potential biases in using human preference scores for LLM evaluation and training, calling for more nuanced and objective approaches to assess model performance.

Impact of Confounders: The study explores the influence of two confounding factors — assertiveness and complexity — on human evaluations. They use instruction-tuned models to generate outputs varying in these dimensions and find that more assertive outputs tend to be perceived as more factually accurate, regardless of their actual factual content. This suggests that human evaluators might be misled by the confidence with which information is presented.

Subjectivity and Bias in Preference Scores: The authors hypothesize that preference scores, which are used to rate the quality of LLM outputs, are subjective and prone to biases. This implies that what one person prefers might not be universally agreed upon and could introduce unintended biases into the evaluation process.

Coverage of Crucial Error Criteria: While preference scores generally cover a wide range of evaluation criteria, they tend to under-represent certain critical aspects, notably factual accuracy. This means that models might be rated favorably even if they produce factually incorrect information as long as human evaluators prefer the output.

Assertiveness in Model Outputs: The authors present preliminary evidence that training models using human feedback can disproportionately increase the assertiveness of their outputs. This raises concerns about the potential for models to become overly confident in their responses, which could further mislead users!

Issues with LLM-as-a-Judge Approaches

LLMs can be used to evaluate other LLMs, but they come with their own set of challenges.

Correlation with human judgments: LLMs does not always correlate with human judgments, limiting their effectiveness in real-world scenarios.

Affordability: Proprietary models are not always affordable, making them impractical to integrate into evaluation pipelines.

Latency: LLMs can have high response times, which might be impractical for large-scale or real-time evaluation needs.

Compliance: Proprietary LLMs often lack transparency about their training data, raising concerns about fairness and compliance.

Biases in LLM-as-a-Judge Approach

Biases in LLMs can have significant implications, affecting how they generate responses and evaluate text. Let's delve into some of these biases based on the provided notes:

Nepotism Bias

LLM evaluators inherently favor text generated by themselves.

Paper - LLM Evaluators Recognize and Favor Their Own Generations

Fallacy Oversight Bias

This bias entails overlooking logical fallacies within arguments. LLMs might accept conclusions without critically evaluating the evidence, potentially propagating flawed or misleading information.

Paper - Humans or LLMs as the Judge? A Study on Judgement Biases

Authority Bias

Attributing greater credibility to statements from perceived authorities, irrespective of the evidence, characterizes this bias. LLMs might uncritically accept expert opinions without adequate scrutiny.

Paper - Humans or LLMs as the Judge? A Study on Judgement Biases

Beauty Bias

LLMs might favor aesthetically pleasing text, potentially overlooking the accuracy or reliability of the content.

Paper - Humans or LLMs as the Judge? A Study on Judgement Biases

Verbosity Bias

LLMs might equate quantity of information with quality, potentially prioritizing verbose text over succinct and accurate content.

Paper - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Positional Bias

LLMs might exhibit a bias towards information placement(beginning or end of a document may be deemed more important), potentially impacting text interpretation.

Paper - Large Language Models are not Fair Evaluators

Attention Bias (for lengthy text)

LLMs can sometimes miss contextual information present in the middle of the lengthy text. This bias suggests that the model may focus more on the beginning and end of passages, potentially leading to incomplete understanding or interpretation of the text.

Paper - Lost in the Middle: How Language Models Use Long Contexts

Sycophancy

LLM assistants tend to agree with the user even when the user is mistaken. They can change their correct answer to incorrect if they are asked, “Are you sure?”

Paper - Towards Understanding Sycophancy in Language Models

Metrics for Evaluating LLM in RAG

RAG is a mature topic now and several metrics have been developed to evaluate RAG systems comprehensively. Let's have a look at some of the most used metrics in the industry.

RAGAS

RAGAS (Retrieval Augmented Generation Assessment) is a framework for reference-free evaluation of RAG systems. Faithfulness in RAGAS refers to the extent to which the generated answer can be inferred from the retrieved context. It uses a two step process to compute Faithfulness.

1. Breakdown: The LLM breaks the response into smaller statements. Here, the LLM only sees the response, not the context.

2. Verification: The LLM then checks each statement against the context to see if they match. In this step, the LLM sees the context but not the original response, only the statements from step 1.

The scores for each statement (0 for inconsistent, 1 for consistent) are averaged to produce a final score. This approach has many issues, which we will discuss in a later section.

Trulens

TruLens offers a Groundedness which is similar to our Context Adherence and RAGAS Faithfulness. It evaluates whether a response is consistent with the provided context.

TruLens Groundedness works as follows:

The response is split up into sentences.
An LLM is given the list of sentences, along with the context. It is asked to:
quote the part of the context (if any) that supports the sentence
rate the "information overlap" between each sentence and the context on a 0-to-10 scale.
The scores are mapped to a range from 0 to 1, and averaged to produce an overall score.

We've observed several failure modes in this procedure which will be covered in a later section.

ChainPoll - Context Adherence

ChainPoll based Context Adherence is a novel approach to hallucination detection that provides an 85% correlation with human feedback. It outperforms other methods like SelfCheckGPT, GPTScore, G-Eval, and TRUE across various benchmark tasks. ChainPoll is faster, more cost-effective, and provides human-readable verbal justifications for its judgments.

It combines two core ideas: Chain-of-Thought (CoT) prompting and polling the model multiple times. Here's a brief overview of what ChainPoll is and how it works.

ChainPoll = Chain + Poll

- Chain: Chain-of-Thought (CoT) prompting

- Poll: Prompting an LLM multiple times

Chain-of-Thought Prompting

Chain-of-Thought (CoT) prompting asks the LLM to explain its reasoning step-by-step before giving the final answer. It works because it mimics how humans solve complex problems. When faced with a tough question, we often think out loud or go through the steps before arriving at an answer. CoT allows the LLM to do the same, improving its accuracy

Polling

Polling involves asking the LLM the same question multiple times and then aggregating the responses. This helps to identify and filter out random errors or hallucinations that the model might generate.

If you ask an LLM the same question multiple times, you might get a mix of correct and incorrect answers. However, correct answers tend to cluster around the right solution, while incorrect ones are more random. By averaging the responses, ChainPoll can highlight the most reliable answer, making the evaluation more robust.

Self-Consistency Vs ChainPoll

ChainPoll is similar to the self-consistency method, which uses majority voting to pick the most common answer. However, ChainPoll goes a step further by averaging the answers to provide a score reflecting the model's certainty level.

Example:

- If you ask whether a response is consistent with a set of documents, you might get:

- Yes (supported)

- No (not supported)

- ChainPoll averages these responses to give a score of 0.67, indicating that the answer is likely but not certain.

This nuanced scoring helps capture the model's confidence level, providing more detailed insights.

Efficiency and Cost

ChainPoll involves requesting multiple responses, which might seem slow and expensive. However, we use batch requests to LLM APIs to generate responses more efficiently. For example, with the OpenAI API, a batch request for three responses from the same prompt is billed for all output tokens but only counts the input tokens once. This reduces the cost significantly, making ChainPoll both fast and cost-effective.

LLMs Used with ChainPoll

By default, we use OpenAI's GPT-4o-mini. While it may be less accurate than GPT-4o, it's faster and cheaper. Using ChainPoll with GPT-4o-mini closes much of the accuracy gap while keeping costs low. However, while evaluating a powerful model like Claude 3.5 Sonnet, it becomes necessary to use a more powerful LLM like GPT-4o for evaluation with ChainPoll.

For a deeper, more technical look at the research behind ChainPoll Context Adherence, check out our paper - ChainPoll: A High-Efficacy Method for LLM Hallucination Detection.

Luna

Galileo Luna is a family of Evaluation Foundation Models (EFM) fine-tuned specifically for hallucination detection in RAG settings. Luna not only outperforms GPT and commercial evaluation frameworks but also significantly reduces cost and latency, making it an ideal candidate for industry LLM applications.

Luna excels on the RAGTruth dataset and shows excellent generalization capabilities. Luna's lightweight nature, combined with significant gains in cost and inference speed, makes it a highly efficient solution for industry applications.

Generalization: Luna outperforms RAGAS & GPT across different industry verticals.

Cost and Latency: Luna achieves a 97% reduction in cost and a 96% reduction in latency compared to GPT-3.5-based approaches.

Multi-task Training

Luna EFMs conduct multiple evaluations using a single input, thanks to multi-task training. This allows EFMs to share granular insights and predictions, leading to more robust and accurate evaluations.

Data Augmentation

Each Luna EFM is trained on large, high-quality datasets spanning various industries and use cases. We enrich our training dataset with synthetic data and data augmentations to improve domain coverage and generalization.

Latency Optimizations

Luna is optimized to process up to 16k input tokens in under one second on an NVIDIA L4 GPU. This is achieved through deploying an ONNX-traced model on an NVIDIA Triton server with a TensorRT backend, leveraging Triton’s Business Logic Scripting (BLS) for efficient resource allocation.

Long Context Support

Luna effectively detects hallucinations in long RAG contexts. Luna is optimized to process up to 128k tokens within milliseconds on general-purpose GPUs like A10G.

RAG Metric Comparison

ChainPoll vs. RAGAS Faithfulness

RAGAS uses a Faithfulness score similar to Galileo's Context Adherence score. Both scores aim to check if a response matches the information in a given context. RAGAS breaks a response into statements, validates each in isolation and gives a final score. This method can fail in several ways that ChainPoll avoids.

Handling Refusals

RAGAS doesn't handle refusal answers well. Sometimes, an LLM will say, "I don't know" or "Sorry, that wasn't mentioned in the context." RAGAS always assigns these answers a score of 0, which is unhelpful. If the information isn't in the context, it's better for the LLM to say so rather than make something up. ChainPoll handles these cases gracefully, checking if the refusal is consistent with the context.

For example, if the LLM responds, "The provided context does not contain information about where the email was published. Therefore, it is not possible to determine where the email was published based on the given passages," Galileo's Context Adherence would score this as 1, with an explanation that the context did not contain the necessary information.

Lack of Explanations

RAGAS generates internal explanations but doesn't show them to the user. These explanations are often brief and less informative than those from ChainPoll. ChainPoll provides detailed, step-by-step explanations, making it easier to understand why a response was scored a certain way.

Misleading Statement Breakdowns

Breaking a response into separate statements can ignore how different parts of the response are related. For example, consider a dataset related to Covid-19. An LLM was asked, "What important risk factors to infection were found during the second case-controlled study?" It responded with, "The important risk factors to infection found during the second case-controlled study were hospitalization in the preceding 90 days, residency in a nursing home, and antibiotic use." This response was incorrect because, while these factors were mentioned in the documents, they were not identified as risk factors in the second case-controlled study.

Galileo's Context Adherence caught this error and gave the response a score of 0. Here’s the explanation:

"The response claims that the important risk factors to infection found during the second case-controlled study were hospitalization in the preceding 90 days, residency in a nursing home, and antibiotic use. To verify this claim, we need to check the specific risk factors mentioned in the second case-controlled study document. However, the second case-controlled study document does not provide specific risk factors related to infection. It mainly focuses on data collection and limitations of the study. Therefore, the claim that hospitalization in the preceding 90 days, residency in a nursing home, and antibiotic use were important risk factors found during the study is not supported by the documents."

RAGAS, however, assigned this response a perfect score of 1.0. The breakdown into statements missed the fact that these risk factors were not identified in the second case-controlled study. Here are the four statements RAGAS generated from the response, along with its reasoning:

Statement 1/4
The second case-controlled study identified several important risk factors to infection.

Reasoning
The passage mentions that a case-control study was conducted to identify risk factors for multi-drug resistant infection in the pediatric intensive care unit (PICU).

Verdict
1 (Consistent)

---

Statement 2/4
These risk factors include hospitalization in the preceding 90 days.

Reasoning
The passage states that hospitalization in the preceding 90 days was a risk factor for infection with a resistant pathogen.

Verdict
1 (Consistent)

---

Statement 3/4
Residency in a nursing home was also found to be a significant risk factor.

Reasoning
The passage mentions that residency in a nursing home was an independent predictor of infection with a resistant pathogen.

Verdict
1 (Consistent)

---

Statement 4/4
Additionally, antibiotic use was identified as an important risk factor.

Reasoning
The passage states that antibiotic use was one of the main contents collected and analyzed in the study.

Verdict
1 (Consistent)

When RAGAS broke down the response into statements, it omitted key information that made the answer inconsistent. Some of the statements are about the second case-controlled study, and some are about risk factors. Taken in isolation, each of these statements is arguably true. But none of them captures the claim that the original LLM got wrong: that these risk factors were identified, not just in any study, but in the second case-controlled study.

ChainPoll allows the LLM to assess the entire input at once and come to a holistic judgment of it. By contrast, RAGAS fragments its reasoning into a sequence of disconnected steps, performed in isolation and without access to complete information. This causes RAGAS to miss subtle or complex errors, like the one in the example above. But, given the increasing intelligence of today's LLMs, subtle and complex errors are precisely the ones you need to be worried about.

ChainPoll vs. TruLens

TruLens offers a Groundedness score, similar to Galileo's Context Adherence and RAGAS Faithfulness. However, there are significant differences in how these scores are calculated.

No Chain-of-Thought Reasoning

TruLens doesn't use chain-of-thought reasoning. Instead, it asks the LLM to quote parts of the context that support each sentence and rate the "information overlap" on a 0-to-10 scale. This approach can lead to errors. For example, in the same Covid-19 study scenario, TruLens might quote a passage mentioning the risk factors but fail to check if they were identified in the second case-controlled study.

Inconsistent Grading System

TruLens uses a 0-to-10 rating scale without clear guidelines. This can lead to inconsistent scores. For example, the LLM might rate the same evidence as 8/10 one time and 7/10 another time, even if the evidence hasn't changed. Galileo has found that LLMs produce more reliable results when asked for a simple ‘Yes or No’ answer.

Formatting Issues

TruLens can get confused by formatting. For example, if the context includes multiple paragraphs separated by line breaks, TruLens might generate malformed output.

Here’s an example where the response was malformed:

Bacillus subtilis can be used as an oral vaccine by expressing the COE antigen of PEDV and delivering it to the GI tract to stimulate mucosal immune responses. It can protect antigens from the harsh digestive environment and enhance immune regulation and gut health in newborn piglets. Additionally, it has adjuvant activity and can deliver heterologous antigens to the GI tract, providing additional immunity stimulation

There are three sentences here. TruLens is supposed to assign a score to each one, together with a supporting quotation. But here is what it actually produced:

- Statement 0: "Bacillus subtilis can be used as an oral vaccine by expressing the COE antigen of PEDV and delivering it to the GI tract to stimulate mucosal immune responses." (Score: 0)

- Statement 1: "Oral administration can protect the gut and stimulate the common mucosal immune system." (Score: 10)

- Statement 2: "Orally administered B. subtilis could enhance immune regulation and gut health in pigs." (Score: 7)

- Statement 3: "B. subtilis is widely used as a vaccine delivery system for its unique characteristics." (Score: 10)

STATEMENT 0:
Statement Sentence: Bacillus subtilis can be used as an oral vaccine by expressing the COE antigen of PEDV and delivering it to the GI tract to stimulate mucosal immune responses. 
Supporting Evidence: NOTHING FOUND 
Score: 0


STATEMENT 1:
Statement Sentence: Oral administration can protect the gut and stimulate the common mucosal immune system. 
Supporting Evidence: As a superior way of mucosal immunization, oral administration can protect the gut and stimulate the common mucosal immune system. 
Score: 10 

Statement Sentence: Oral immunization often delivers large amounts of antigens to prevent diarrheal diseases. 
Supporting Evidence: Oral immunization often delivers large amounts of antigens to prevent the diarrheal diseases. 
Score: 10 

Statement Sentence: Bacillus subtilis is widely used as a vaccine delivery system for its unique characteristics. 
Supporting Evidence: Currently, Bacillus subtilis is widely used as a vaccine delivery system for its unique characteristics. 
Score: 10 

Statement Sentence: Orally administered Bacillus subtilis can enhance immune regulation and gut health in pigs. 
Supporting Evidence: Moreover, oral administration of B. subtilis could elicit humoral and cellular immune responses to the maintenance of gut homeostasis by dendritic cells. 
Score: 5


STATEMENT 2:
Statement Sentence: Oral administration can protect the gut and stimulate the common mucosal immune system. 
Supporting Evidence: As a superior way of mucosal immunization, oral administration can protect the gut and stimulate the common mucosal immune system. 
Score: 10 

Statement Sentence: Orally administered B. subtilis could enhance immune regulation and gut health in pigs. 
Supporting Evidence: Moreover, oral administration of B. subtilis could elicit humoral and cellular immune responses to the maintenance of gut homeostasis by dendritic cells. 
Score: 7 

Statement Sentence: B. subtilis is widely used as a vaccine delivery system for its unique characteristics. 
Supporting Evidence: Currently, Bacillus subtilis is widely used as a vaccine delivery system for its unique characteristics. 
Score: 10

After the first statement, things go off the rails. The sentences listed under "Statement 1" and "Statement 2" don't appear in the response at all. And, nonsensically, the LLM has written multiple "Statement Sentences" under each of the "STATEMENT" headings.

In a case like this, the TruLens codebase assumes that each STATEMENT heading only has one score under it, and ends up picking the first one listed. Here, it ended up with the scores [0, 10, 10] for the three statements. But the latter two scores are nonsense—they're not about sentences from the response at all.

We tracked this issue down to formatting. Our context included multiple paragraphs and documents, which were separated by line breaks. It turns out that TruLens' prompt format also uses line breaks to delimit sections of the prompt. Apparently, the LLM became confused by which line breaks meant what. Replacing line breaks with spaces fixed the problem in this case. But you shouldn't have to worry about this kind of thing at all. Line breaks are not an exotic edge case, after all.

The prompt formats we use for Galileo ChainPoll based metrics involve a more robust delimiting strategy, including reformatting your output in some cases if needed. This prevents issues like this from arising with ChainPoll.

RAG Benchmarks

Several benchmarks have been developed to evaluate and rank LLMs used in RAG systems.

ChatRAG-Bench

ChatRAG-Bench is a collection of datasets designed to evaluate the model's capability in conversational QA and RAG. It covers a wide range of documents and question types, requiring models to generate responses from retrieved context, comprehend and reason over tables, conduct arithmetic calculations, and indicate when questions cannot be answered within the context.

Long Document Datasets

ChatRAG-Bench includes long document datasets that cannot fit into LLMs with a sequence length of 4K or 8K tokens. Examples include:

Doc2Dial (D2D): A document-grounded conversational QA dataset covering domains like DMV, SSA, VA, and Student Aid.

QuAC: Based on Wikipedia documents, containing unanswerable cases.

QReCC: An open-domain conversational QA dataset across multiple sources.

TopiOCQA (TCQA): Requires the agent to search the entire Wikipedia for answers.

INSCIT: Studies cases where user questions are under-specified and require clarification.

For Doc2Dial, QuAC, and QReCC, documents are segmented into 300-word chunks, and the top-5 relevant chunks are retrieved as context for each user question. For TopiOCQA and INSCIT, smaller chunks are used, and the top-20 chunks are retrieved to provide similar context length.

Short Document Datasets

ChatRAG-Bench also includes short document datasets that can fit into LLMs with a sequence length of 4K tokens. Examples include:

CoQA: A conversational QA dataset with short passages covering various domains.

DoQA: Covers domains like cooking, travel, and movies from Stack Exchange forums and contains unanswerable cases.

ConvFinQA (CFQA): Based on financial reports, involving arithmetic calculations.

SQA: Grounded on documents containing only a single table from Wikipedia.

HybriDial (HDial): Contains both Wikipedia tabular and textual data.

Results

This benchmark evaluation assesses various LLMs (Large Language Models) for their performance on ChatRAG-Bench, a suite of task-specific metrics typically used to evaluate Retrieval-Augmented Generation (RAG) performance. The table provides mean scores across different datasets or tasks for several models. Here’s a detailed commentary on the results:

Benchmark Overview

The table includes nine different models: ChatQA-1.0-7 B, Command-R-Plus, Llama3-instruct-70 b, GPT-4-0613, GPT-4-Turbo, ChatQA-1.0-70 B, ChatQA-1.5-8 B, and ChatQA-1.5-70B.

Top Performers

ChatQA-1.5-70B consistently scores among the highest across most tasks, indicating strong performance. Particularly notable scores include:

- ConvFinQA: 81.88

- SQA: 83.82

GPT-4-Turbo also shows impressive results for ConvFinQA (84.16) and DoQA (51.94)

Average Performance

The overall averages indicate that ChatQA-1.5-70B (58.25) performs better than the other models, followed by ChatQA-1.5-8B (55.17) and GPT-4-Turbo (53.99). Excluding the HybriDial dataset, the trend remains consistent with ChatQA-1.5-70B leading (57.14), demonstrating its robustness even without the influence of particular datasets.

Specific Observations

For the SQA dataset, the highest scores are for ChatQA-1.5-70B (83.82) and ChatQA-1.5-8B (73.28), indicating strong model performance in structured querying tasks. In the HybriDial task, high scores for GPT-4-Turbo (56.44) and ChatQA models suggest their effectiveness in handling dialogue contexts requiring both retrieval and generation. The INSCIT dataset shows lower scores across the board, with the maximum being 36.34 from GPT-4-0613, indicating room for improvement in information synthesis tasks.

Insights on Dataset Performance

ConvFinQA emerges as a task where several models, notably GPT-4-Turbo and ChatQA-1.5-70B, excel, reaching upwards of 81. Conversely, datasets like INSCIT and QuAC have generally seen lower performance, suggesting they are more challenging and could be focal points for improving model capability.

Evaluation

ChatRAG-Bench evaluation dataset using the following metrics as per their ChatQA paper.

F1-Score: The F1 score is calculated based on token matching with the ground truth answer. It measures the overlap between the model's generated answer and the correct answer at the token level. F1 is the harmonic mean of precision and recall. The F1 score is used as an evaluation metric on all datasets except ConvfinQA.

Exact Match Accuracy: This metric measures the percentage of model responses that perfectly match the ground truth answer, word for word. Exact Match Accuracy as evaluation metric for ConvFinQA dataset.

Although a widely accepted metric, the F1 score used for ChatRAG-Bench has limitations—especially in capturing the semantic nuances and contexts of generated responses.

Sentence Variability: Different answers like "No, she was not happy," "No, she was sad," will receive different F1 scores despite the same meaning.

Partial Matches: Answers that do not perfectly align with the ground truth, such as "She used a can of orange paint to paint herself orange" versus "she painted herself," will score less than 1 despite their relevance, highlighting the token-level matching imperfection.

These issues underscore the need for a more holistic and context-aware metric to better gauge model competence.

Hallucination Index

The Hallucination Index is an ongoing initiative to evaluate and rank the largest and most popular LLMs based on their propensity to hallucinate in RAG tasks.

Methodology

The evaluation process challenges the models' abilities to stay on task and provide accurate, contextually relevant responses. It evaluates the most popular LLMs available today. These models are selected based on surveys of popular LLM repositories, leaderboards, and industry surveys.

LLMs are tested across three common task types – short, medium, and long context lengths, each presenting unique challenges and benefits.

Short Context RAG

For short context lengths, the evaluation utilizes a variety of demanding datasets to test the robustness of models. One key methodology is ChainPoll with GPT-4o, which leverages the strong reasoning power of GPT series models. By using a chain of thought technique to poll the model multiple times, ChainPoll quantifies potential hallucinations along with explanations.

DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs.

Microsoft MS Macro: Contains queries and paragraphs with relevance labels.

HotpotQA: Requires finding and reasoning over multiple supporting documents.

ConvFinQA: Focuses on numerical reasoning in conversational QA.

Medium and Long Context RAG

The methodology for medium and long context lengths focuses on the models' ability to comprehensively understand extensive texts. Text is extracted from recent documents, divided into chunks, and one chunk is designated as the "needle chunk." Retrieval questions are constructed to test the model's ability to find and use the needle chunk within varying context lengths:

Medium context lengths: 5k, 10k, 15k, 20k, 25k tokens.

Long context lengths: 40k, 60k, 80k, 100k tokens.

The task design ensures that all text in the context is from a single domain, responses are correct with short context, and questions cannot be answered from pre-training memory. This approach measures the influence of the needle chunk's position and avoids standard datasets to prevent test leakage.

Prompting Techniques

The evaluation also experiments with a prompting technique known as Chain-of-Note, which has shown promise in improving performance for short contexts. This technique is tested to see if it similarly benefits medium and long contexts.

Evaluation

Context Adherence evaluates the degree to which a model's response aligns with the given context, serving as a metric to gauge closed-domain hallucinations. A higher Context Adherence score indicates that the response contains only information from the provided context, while a lower score suggests the inclusion of information not in the context.

The Context Adherence score is calculated using ChainPoll which leverages the strong reasoning abilities of LLMs to assess the accuracy of responses. This approach not only quantifies potential hallucinations but also provides context-based explanations. It works for both answerable and unanswerable cases.

The final score is calculated as the mean of the scores for each task dataset. The dataset score is the mean of the ChainPoll score for each sample.

Limitations of Benchmarks

While each benchmark provides valuable insights, they also have their shortcomings.

ChatRAG-Bench

Conversational Focus: This benchmark primarily evaluates conversational RAG, which may not fully capture the model's performance in single-turn RAG scenarios.

Short Contexts: The evaluation is limited to context length of 10k, potentially overlooking the model's capabilities with longer contexts.

Evaluation Metrics: It uses F1 or exact match metrics for automated evaluation, which can sometimes over-penalize or under-penalize the model, leading to inaccurate scores.

Hallucination Index

Single-Turn Focus: This benchmark focuses on single-turn RAG capabilities, leaving its performance in multi-turn chat scenarios uncertain.

Negative Rejection: It does not account for the model's ability to reject incorrect answers when no relevant information is available.

General Limitations

Domain Coverage: Both benchmarks cover only a few domains, which may not accurately represent performance in specialized fields like medical or legal domains.

Dataset Leakage: Both benchmarks are built on open datasets, which can lead to data leakage during the training of the models being evaluated.

By understanding these limitations, users can better interpret the results of these benchmarks and make more informed decisions about the performance and reliability of their LLMs in various RAG scenarios.

Evaluate LLMs for RAG

Now that we understand the challenges of evaluating RAG systems, lets look at the basic steps involved in evaluation.

Load the embedding model
Create a new vector index
Vectorise chunks and add to the index
Load the chain
Prepare Galileo callback with metrics
Run the chain with questions to generate the answer
Sync data to the Galileo console

Have a look at our end-to-end examples for fixing common challenges of building RAG:

Best LLMs for RAG

Before we conclude, I'd like to share a project we've been working on for some time. As an evaluation company, we're frequently asked by our customers which LLM is best for RAG. However, this question doesn't have a simple answer.

Barely a day goes by without a new LLM being released. While private models continue to improve, enterprises are increasingly curious about whether open-source alternatives have caught up; specifically, they want to know if open-source models are robust enough to handle production-level RAG tasks.

Another growing concern in RAG is the significant cost of vector databases and whether it can be eliminated with long context RAG. To make your selection process easier, we’ve analyzed the top LLMs for RAG tasks, both open and closed source, with insights into their performance across different context lengths.

How Should an Enterprise Evaluate RAG?

This brings us to the final section, where we want to share a brief plan for evaluating enterprises' RAG systems.

Defining Clear Objectives: Understand the specific requirements and objectives of the RAG system within your enterprise. For example, a healthcare organization may prioritize accuracy and up-to-date information in medical queries.

Selecting Appropriate Benchmarks: Choose benchmarks that align with your objectives and cover the relevant domains and task types. For instance, a financial institution may benefit from using benchmarks that include financial reports and numerical reasoning tasks.

Comprehensive Testing: Conduct thorough testing across various dimensions, including context length, domain, and performance metrics. Use a combination of benchmarks like ChatRAG-Bench and Hallucination Index to get a holistic view of the model's capabilities.

Continuous Monitoring and Updating: Regularly monitor the performance of the RAG system and update it as needed to ensure it remains accurate and reliable. Implement a feedback loop to incorporate new data and improve the system continuously.

Evaluation Tooling: Utilize specialized tools and frameworks to automate and streamline the evaluation process. High fidelity metrics like ChainPoll can help detect hallucinations and improve answer accuracy. These tools can save time and provide more consistent and objective evaluations compared to manual assessments.

Human Evaluation: Incorporate human evaluations to assess the quality of the generated responses and identify areas for improvement. Human evaluators can provide insights into the nuances of the responses that automated metrics might miss. For instance, in legal or medical domains, experts can evaluate the relevance and accuracy of the information provided by the model.

Scenario-Based Testing: Create specific scenarios that reflect real-world use cases within your enterprise. For example, a customer support system might be tested on its ability to handle a variety of customer queries, while a research assistant might be evaluated on its ability to retrieve and summarize the latest research papers. Scenario-based testing ensures that the RAG system performs well in practical applications.

Domain-Specific Customization: Customize the evaluation process to account for the unique challenges and requirements of your domain. This might involve creating custom datasets, developing specialized metrics, or tailoring benchmarks to better reflect the specific tasks and information needs of your enterprise.

Error Analysis: Conduct detailed error analysis to understand the types of mistakes the RAG system makes. This involves categorizing errors, identifying common failure points, and analyzing why these errors occur. By understanding the root causes of errors, you can make targeted improvements to the system, enhancing its overall performance and reliability.

Conclusion

By considering various dimensions of RAG evaluation and leveraging comprehensive metrics, enterprises can build reliable RAG systems. This not only enhances the system's performance but also builds trust and confidence among users. Have a great time building RAG!

Sign up for your free Galileo account today, or continue your Mastering RAG journey with our free, comprehensive eBook.

Welcome to yet another deep dive post in our Mastering RAG series!

Previously, we dove into the intricacies of building enterprise-level RAG systems, exploring RAG architecture, implementation strategies, and best practices for deploying these systems effectively. This time we aim to provide a comprehensive guide on how to assess the performance of LLMs in RAG systems, covering essential dimensions, metrics, and benchmarks. Whether you're a seasoned practitioner or new to the field, this guide will equip you with the knowledge and tools needed to ensure your RAG systems are robust, accurate, and reliable.

Short Introduction to RAG

RAG enhances the capabilities of LLMs by integrating a retrieval component. This component fetches relevant documents from a large corpus, which the LLM then uses to generate accurate and contextually appropriate responses. Unlike traditional LLMs that rely solely on pre-trained knowledge, RAG systems dynamically incorporate external information, making them more versatile and reliable. For example, a RAG system can retrieve the latest research papers for a medical query, ensuring that the response is based on the most current information available.

While fine-tuning involves adapting a pre-trained model to a specific task or domain through additional training, RAG systems leverage external databases to fetch relevant information in real-time. This approach mitigates the need for extensive fine-tuning and allows the model to access the most current information, reducing the risk of generating outdated or incorrect responses. For instance, a fine-tuned model on financial data may become outdated as time passes, whereas a RAG system can retrieve the latest financial statements to provide up-to-date answers.

Dimensions to Evaluate for RAG

When evaluating LLMs for RAG, several dimensions need to be considered to ensure comprehensive assessment:

Instruct or Chat

The first dimension involves determining whether the model is designed for instructional purposes or conversational interactions. Instructional models are typically more straightforward, focusing on providing information or performing tasks based on direct queries. Conversational models, on the other hand, need to handle multi-turn dialogues, maintain context, and provide coherent and relevant responses throughout the interaction. For example, a conversational model should be able to remember previous interactions and provide contextually relevant answers in a customer support scenario.

Context Length

The ability of a model to handle varying context lengths is crucial. Short context lengths are easier to manage but may not provide sufficient information for complex queries. Long context lengths, while more informative, pose challenges in terms of memory and processing power. Evaluating how well a model performs across different context lengths helps in understanding its robustness and scalability. For instance, a legal document may require a model to process thousands of tokens to provide a comprehensive answer.

Domain

Different domains have unique requirements and challenges. For instance, legal documents require precise language and adherence to specific terminologies, while medical texts demand high accuracy and sensitivity to context. Evaluating a model's performance across various domains ensures its versatility and reliability in real-world applications. For example, a model trained on general knowledge may not perform well in specialized domains like finance or healthcare without proper evaluation and adaptation.

Tabular Data QA

Handling tabular data is another critical dimension. Many real-world applications involve data presented in tables, requiring the model to perform operations like filtering, sorting, and numerical calculations. Evaluating the model's ability to comprehend and reason over tabular data is essential for tasks in finance, healthcare, and other data-intensive fields. For instance, answering a question about financial reports may require the model to interpret and analyze complex tables.

Robustness to Noise

Noise robustness measures the model's ability to filter out irrelevant information and focus on the pertinent details. This is particularly important in scenarios where the retrieved documents contain a mix of relevant and irrelevant information. For example, a model should be able to extract useful information from a noisy dataset containing both relevant research papers and unrelated articles.

Counterfactual Robustness

Counterfactual robustness assesses the model's ability to identify and handle incorrect or misleading information in the retrieved documents. This ensures that the model can provide accurate responses even when faced with erroneous data. For instance, a model should be able to detect and disregard incorrect information in a news article to provide a reliable answer.

Negative Rejection

Negative rejection evaluates whether the model can recognize when it does not have sufficient information to answer a query and appropriately decline to provide an answer. This is crucial for maintaining the reliability and trustworthiness of the system. For example, a model should be able to indicate that it cannot answer a question about a recent event if it does not have access to up-to-date information.

Information Integration

Information integration measures the model's ability to synthesize information from multiple documents to provide a comprehensive answer. This is particularly important for complex queries that cannot be answered by a single document. For instance, answering a question about the impact of a new law may require integrating information from multiple legal texts and expert opinions.

Information Update

Evaluating the model's ability to handle information which becomes stale with the help of fresh context is needed for generative web search. This ensures that the model can provide up-to-date and accurate responses. For example, a model should be able to handle the latest news articles to answer a question about current events accurately discarding its internal memory from pretraining.

Challenges in RAG Evaluation

Everyone wants great evals. However, the process is fraught with difficulties, including subjective biases, high costs, and technical limitations. This section delves into the key challenges faced in LLM evaluation, focusing on issues with vibe-check based evaluations, the use of LLMs as judges, and inherent biases that can skew results.

Issues with Vibe-Check Based Evaluation

The "vibe check" approach to evaluating LLMs involves subjective human judgments, often through A/B testing with crowd workers. While this method simulates real-world interactions and helps rank models based on usefulness and harmlessness, it has serious limitations.

Expensive: Conducting human evaluation is expensive and time-consuming, requiring coordination with annotators, custom web interfaces, detailed annotation instructions, data analysis, and considerations for employing crowdworkers.

The recent paper - Human Feedback is not Gold Standard - highlights limitations and potential biases in using human preference scores for LLM evaluation and training, calling for more nuanced and objective approaches to assess model performance.

Impact of Confounders: The study explores the influence of two confounding factors — assertiveness and complexity — on human evaluations. They use instruction-tuned models to generate outputs varying in these dimensions and find that more assertive outputs tend to be perceived as more factually accurate, regardless of their actual factual content. This suggests that human evaluators might be misled by the confidence with which information is presented.

Subjectivity and Bias in Preference Scores: The authors hypothesize that preference scores, which are used to rate the quality of LLM outputs, are subjective and prone to biases. This implies that what one person prefers might not be universally agreed upon and could introduce unintended biases into the evaluation process.

Coverage of Crucial Error Criteria: While preference scores generally cover a wide range of evaluation criteria, they tend to under-represent certain critical aspects, notably factual accuracy. This means that models might be rated favorably even if they produce factually incorrect information as long as human evaluators prefer the output.

Assertiveness in Model Outputs: The authors present preliminary evidence that training models using human feedback can disproportionately increase the assertiveness of their outputs. This raises concerns about the potential for models to become overly confident in their responses, which could further mislead users!

Issues with LLM-as-a-Judge Approaches

LLMs can be used to evaluate other LLMs, but they come with their own set of challenges.

Correlation with human judgments: LLMs does not always correlate with human judgments, limiting their effectiveness in real-world scenarios.

Affordability: Proprietary models are not always affordable, making them impractical to integrate into evaluation pipelines.

Latency: LLMs can have high response times, which might be impractical for large-scale or real-time evaluation needs.

Compliance: Proprietary LLMs often lack transparency about their training data, raising concerns about fairness and compliance.

Biases in LLM-as-a-Judge Approach

Biases in LLMs can have significant implications, affecting how they generate responses and evaluate text. Let's delve into some of these biases based on the provided notes:

Nepotism Bias

LLM evaluators inherently favor text generated by themselves.

Paper - LLM Evaluators Recognize and Favor Their Own Generations

Fallacy Oversight Bias

This bias entails overlooking logical fallacies within arguments. LLMs might accept conclusions without critically evaluating the evidence, potentially propagating flawed or misleading information.

Paper - Humans or LLMs as the Judge? A Study on Judgement Biases

Authority Bias

Attributing greater credibility to statements from perceived authorities, irrespective of the evidence, characterizes this bias. LLMs might uncritically accept expert opinions without adequate scrutiny.

Paper - Humans or LLMs as the Judge? A Study on Judgement Biases

Beauty Bias

LLMs might favor aesthetically pleasing text, potentially overlooking the accuracy or reliability of the content.

Paper - Humans or LLMs as the Judge? A Study on Judgement Biases

Verbosity Bias

LLMs might equate quantity of information with quality, potentially prioritizing verbose text over succinct and accurate content.

Paper - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Positional Bias

LLMs might exhibit a bias towards information placement(beginning or end of a document may be deemed more important), potentially impacting text interpretation.

Paper - Large Language Models are not Fair Evaluators

Attention Bias (for lengthy text)

LLMs can sometimes miss contextual information present in the middle of the lengthy text. This bias suggests that the model may focus more on the beginning and end of passages, potentially leading to incomplete understanding or interpretation of the text.

Paper - Lost in the Middle: How Language Models Use Long Contexts

Sycophancy

LLM assistants tend to agree with the user even when the user is mistaken. They can change their correct answer to incorrect if they are asked, “Are you sure?”

Paper - Towards Understanding Sycophancy in Language Models

Metrics for Evaluating LLM in RAG

RAG is a mature topic now and several metrics have been developed to evaluate RAG systems comprehensively. Let's have a look at some of the most used metrics in the industry.

RAGAS

RAGAS (Retrieval Augmented Generation Assessment) is a framework for reference-free evaluation of RAG systems. Faithfulness in RAGAS refers to the extent to which the generated answer can be inferred from the retrieved context. It uses a two step process to compute Faithfulness.

1. Breakdown: The LLM breaks the response into smaller statements. Here, the LLM only sees the response, not the context.

2. Verification: The LLM then checks each statement against the context to see if they match. In this step, the LLM sees the context but not the original response, only the statements from step 1.

The scores for each statement (0 for inconsistent, 1 for consistent) are averaged to produce a final score. This approach has many issues, which we will discuss in a later section.

Trulens

TruLens offers a Groundedness which is similar to our Context Adherence and RAGAS Faithfulness. It evaluates whether a response is consistent with the provided context.

TruLens Groundedness works as follows:

The response is split up into sentences.
An LLM is given the list of sentences, along with the context. It is asked to:
quote the part of the context (if any) that supports the sentence
rate the "information overlap" between each sentence and the context on a 0-to-10 scale.
The scores are mapped to a range from 0 to 1, and averaged to produce an overall score.

We've observed several failure modes in this procedure which will be covered in a later section.

ChainPoll - Context Adherence

ChainPoll based Context Adherence is a novel approach to hallucination detection that provides an 85% correlation with human feedback. It outperforms other methods like SelfCheckGPT, GPTScore, G-Eval, and TRUE across various benchmark tasks. ChainPoll is faster, more cost-effective, and provides human-readable verbal justifications for its judgments.

It combines two core ideas: Chain-of-Thought (CoT) prompting and polling the model multiple times. Here's a brief overview of what ChainPoll is and how it works.

ChainPoll = Chain + Poll

- Chain: Chain-of-Thought (CoT) prompting

- Poll: Prompting an LLM multiple times

Chain-of-Thought Prompting

Chain-of-Thought (CoT) prompting asks the LLM to explain its reasoning step-by-step before giving the final answer. It works because it mimics how humans solve complex problems. When faced with a tough question, we often think out loud or go through the steps before arriving at an answer. CoT allows the LLM to do the same, improving its accuracy

Polling

Polling involves asking the LLM the same question multiple times and then aggregating the responses. This helps to identify and filter out random errors or hallucinations that the model might generate.

If you ask an LLM the same question multiple times, you might get a mix of correct and incorrect answers. However, correct answers tend to cluster around the right solution, while incorrect ones are more random. By averaging the responses, ChainPoll can highlight the most reliable answer, making the evaluation more robust.

Self-Consistency Vs ChainPoll

ChainPoll is similar to the self-consistency method, which uses majority voting to pick the most common answer. However, ChainPoll goes a step further by averaging the answers to provide a score reflecting the model's certainty level.

Example:

- If you ask whether a response is consistent with a set of documents, you might get:

- Yes (supported)

- No (not supported)

- ChainPoll averages these responses to give a score of 0.67, indicating that the answer is likely but not certain.

This nuanced scoring helps capture the model's confidence level, providing more detailed insights.

Efficiency and Cost

ChainPoll involves requesting multiple responses, which might seem slow and expensive. However, we use batch requests to LLM APIs to generate responses more efficiently. For example, with the OpenAI API, a batch request for three responses from the same prompt is billed for all output tokens but only counts the input tokens once. This reduces the cost significantly, making ChainPoll both fast and cost-effective.

LLMs Used with ChainPoll

By default, we use OpenAI's GPT-4o-mini. While it may be less accurate than GPT-4o, it's faster and cheaper. Using ChainPoll with GPT-4o-mini closes much of the accuracy gap while keeping costs low. However, while evaluating a powerful model like Claude 3.5 Sonnet, it becomes necessary to use a more powerful LLM like GPT-4o for evaluation with ChainPoll.

For a deeper, more technical look at the research behind ChainPoll Context Adherence, check out our paper - ChainPoll: A High-Efficacy Method for LLM Hallucination Detection.

Luna

Galileo Luna is a family of Evaluation Foundation Models (EFM) fine-tuned specifically for hallucination detection in RAG settings. Luna not only outperforms GPT and commercial evaluation frameworks but also significantly reduces cost and latency, making it an ideal candidate for industry LLM applications.

Luna excels on the RAGTruth dataset and shows excellent generalization capabilities. Luna's lightweight nature, combined with significant gains in cost and inference speed, makes it a highly efficient solution for industry applications.

Generalization: Luna outperforms RAGAS & GPT across different industry verticals.

Cost and Latency: Luna achieves a 97% reduction in cost and a 96% reduction in latency compared to GPT-3.5-based approaches.

Multi-task Training

Luna EFMs conduct multiple evaluations using a single input, thanks to multi-task training. This allows EFMs to share granular insights and predictions, leading to more robust and accurate evaluations.

Data Augmentation

Each Luna EFM is trained on large, high-quality datasets spanning various industries and use cases. We enrich our training dataset with synthetic data and data augmentations to improve domain coverage and generalization.

Latency Optimizations

Luna is optimized to process up to 16k input tokens in under one second on an NVIDIA L4 GPU. This is achieved through deploying an ONNX-traced model on an NVIDIA Triton server with a TensorRT backend, leveraging Triton’s Business Logic Scripting (BLS) for efficient resource allocation.

Long Context Support

Luna effectively detects hallucinations in long RAG contexts. Luna is optimized to process up to 128k tokens within milliseconds on general-purpose GPUs like A10G.

RAG Metric Comparison

ChainPoll vs. RAGAS Faithfulness

RAGAS uses a Faithfulness score similar to Galileo's Context Adherence score. Both scores aim to check if a response matches the information in a given context. RAGAS breaks a response into statements, validates each in isolation and gives a final score. This method can fail in several ways that ChainPoll avoids.

Handling Refusals

RAGAS doesn't handle refusal answers well. Sometimes, an LLM will say, "I don't know" or "Sorry, that wasn't mentioned in the context." RAGAS always assigns these answers a score of 0, which is unhelpful. If the information isn't in the context, it's better for the LLM to say so rather than make something up. ChainPoll handles these cases gracefully, checking if the refusal is consistent with the context.

For example, if the LLM responds, "The provided context does not contain information about where the email was published. Therefore, it is not possible to determine where the email was published based on the given passages," Galileo's Context Adherence would score this as 1, with an explanation that the context did not contain the necessary information.

Lack of Explanations

RAGAS generates internal explanations but doesn't show them to the user. These explanations are often brief and less informative than those from ChainPoll. ChainPoll provides detailed, step-by-step explanations, making it easier to understand why a response was scored a certain way.

Misleading Statement Breakdowns

Breaking a response into separate statements can ignore how different parts of the response are related. For example, consider a dataset related to Covid-19. An LLM was asked, "What important risk factors to infection were found during the second case-controlled study?" It responded with, "The important risk factors to infection found during the second case-controlled study were hospitalization in the preceding 90 days, residency in a nursing home, and antibiotic use." This response was incorrect because, while these factors were mentioned in the documents, they were not identified as risk factors in the second case-controlled study.

Galileo's Context Adherence caught this error and gave the response a score of 0. Here’s the explanation:

"The response claims that the important risk factors to infection found during the second case-controlled study were hospitalization in the preceding 90 days, residency in a nursing home, and antibiotic use. To verify this claim, we need to check the specific risk factors mentioned in the second case-controlled study document. However, the second case-controlled study document does not provide specific risk factors related to infection. It mainly focuses on data collection and limitations of the study. Therefore, the claim that hospitalization in the preceding 90 days, residency in a nursing home, and antibiotic use were important risk factors found during the study is not supported by the documents."

RAGAS, however, assigned this response a perfect score of 1.0. The breakdown into statements missed the fact that these risk factors were not identified in the second case-controlled study. Here are the four statements RAGAS generated from the response, along with its reasoning:

Statement 1/4
The second case-controlled study identified several important risk factors to infection.

Reasoning
The passage mentions that a case-control study was conducted to identify risk factors for multi-drug resistant infection in the pediatric intensive care unit (PICU).

Verdict
1 (Consistent)

---

Statement 2/4
These risk factors include hospitalization in the preceding 90 days.

Reasoning
The passage states that hospitalization in the preceding 90 days was a risk factor for infection with a resistant pathogen.

Verdict
1 (Consistent)

---

Statement 3/4
Residency in a nursing home was also found to be a significant risk factor.

Reasoning
The passage mentions that residency in a nursing home was an independent predictor of infection with a resistant pathogen.

Verdict
1 (Consistent)

---

Statement 4/4
Additionally, antibiotic use was identified as an important risk factor.

Reasoning
The passage states that antibiotic use was one of the main contents collected and analyzed in the study.

Verdict
1 (Consistent)

When RAGAS broke down the response into statements, it omitted key information that made the answer inconsistent. Some of the statements are about the second case-controlled study, and some are about risk factors. Taken in isolation, each of these statements is arguably true. But none of them captures the claim that the original LLM got wrong: that these risk factors were identified, not just in any study, but in the second case-controlled study.

ChainPoll allows the LLM to assess the entire input at once and come to a holistic judgment of it. By contrast, RAGAS fragments its reasoning into a sequence of disconnected steps, performed in isolation and without access to complete information. This causes RAGAS to miss subtle or complex errors, like the one in the example above. But, given the increasing intelligence of today's LLMs, subtle and complex errors are precisely the ones you need to be worried about.

ChainPoll vs. TruLens

TruLens offers a Groundedness score, similar to Galileo's Context Adherence and RAGAS Faithfulness. However, there are significant differences in how these scores are calculated.

No Chain-of-Thought Reasoning

TruLens doesn't use chain-of-thought reasoning. Instead, it asks the LLM to quote parts of the context that support each sentence and rate the "information overlap" on a 0-to-10 scale. This approach can lead to errors. For example, in the same Covid-19 study scenario, TruLens might quote a passage mentioning the risk factors but fail to check if they were identified in the second case-controlled study.

Inconsistent Grading System

TruLens uses a 0-to-10 rating scale without clear guidelines. This can lead to inconsistent scores. For example, the LLM might rate the same evidence as 8/10 one time and 7/10 another time, even if the evidence hasn't changed. Galileo has found that LLMs produce more reliable results when asked for a simple ‘Yes or No’ answer.

Formatting Issues

TruLens can get confused by formatting. For example, if the context includes multiple paragraphs separated by line breaks, TruLens might generate malformed output.

Here’s an example where the response was malformed:

Bacillus subtilis can be used as an oral vaccine by expressing the COE antigen of PEDV and delivering it to the GI tract to stimulate mucosal immune responses. It can protect antigens from the harsh digestive environment and enhance immune regulation and gut health in newborn piglets. Additionally, it has adjuvant activity and can deliver heterologous antigens to the GI tract, providing additional immunity stimulation

There are three sentences here. TruLens is supposed to assign a score to each one, together with a supporting quotation. But here is what it actually produced:

- Statement 0: "Bacillus subtilis can be used as an oral vaccine by expressing the COE antigen of PEDV and delivering it to the GI tract to stimulate mucosal immune responses." (Score: 0)

- Statement 1: "Oral administration can protect the gut and stimulate the common mucosal immune system." (Score: 10)

- Statement 2: "Orally administered B. subtilis could enhance immune regulation and gut health in pigs." (Score: 7)

- Statement 3: "B. subtilis is widely used as a vaccine delivery system for its unique characteristics." (Score: 10)

STATEMENT 0:
Statement Sentence: Bacillus subtilis can be used as an oral vaccine by expressing the COE antigen of PEDV and delivering it to the GI tract to stimulate mucosal immune responses. 
Supporting Evidence: NOTHING FOUND 
Score: 0


STATEMENT 1:
Statement Sentence: Oral administration can protect the gut and stimulate the common mucosal immune system. 
Supporting Evidence: As a superior way of mucosal immunization, oral administration can protect the gut and stimulate the common mucosal immune system. 
Score: 10 

Statement Sentence: Oral immunization often delivers large amounts of antigens to prevent diarrheal diseases. 
Supporting Evidence: Oral immunization often delivers large amounts of antigens to prevent the diarrheal diseases. 
Score: 10 

Statement Sentence: Bacillus subtilis is widely used as a vaccine delivery system for its unique characteristics. 
Supporting Evidence: Currently, Bacillus subtilis is widely used as a vaccine delivery system for its unique characteristics. 
Score: 10 

Statement Sentence: Orally administered Bacillus subtilis can enhance immune regulation and gut health in pigs. 
Supporting Evidence: Moreover, oral administration of B. subtilis could elicit humoral and cellular immune responses to the maintenance of gut homeostasis by dendritic cells. 
Score: 5


STATEMENT 2:
Statement Sentence: Oral administration can protect the gut and stimulate the common mucosal immune system. 
Supporting Evidence: As a superior way of mucosal immunization, oral administration can protect the gut and stimulate the common mucosal immune system. 
Score: 10 

Statement Sentence: Orally administered B. subtilis could enhance immune regulation and gut health in pigs. 
Supporting Evidence: Moreover, oral administration of B. subtilis could elicit humoral and cellular immune responses to the maintenance of gut homeostasis by dendritic cells. 
Score: 7 

Statement Sentence: B. subtilis is widely used as a vaccine delivery system for its unique characteristics. 
Supporting Evidence: Currently, Bacillus subtilis is widely used as a vaccine delivery system for its unique characteristics. 
Score: 10

After the first statement, things go off the rails. The sentences listed under "Statement 1" and "Statement 2" don't appear in the response at all. And, nonsensically, the LLM has written multiple "Statement Sentences" under each of the "STATEMENT" headings.

In a case like this, the TruLens codebase assumes that each STATEMENT heading only has one score under it, and ends up picking the first one listed. Here, it ended up with the scores [0, 10, 10] for the three statements. But the latter two scores are nonsense—they're not about sentences from the response at all.

We tracked this issue down to formatting. Our context included multiple paragraphs and documents, which were separated by line breaks. It turns out that TruLens' prompt format also uses line breaks to delimit sections of the prompt. Apparently, the LLM became confused by which line breaks meant what. Replacing line breaks with spaces fixed the problem in this case. But you shouldn't have to worry about this kind of thing at all. Line breaks are not an exotic edge case, after all.

The prompt formats we use for Galileo ChainPoll based metrics involve a more robust delimiting strategy, including reformatting your output in some cases if needed. This prevents issues like this from arising with ChainPoll.

RAG Benchmarks

Several benchmarks have been developed to evaluate and rank LLMs used in RAG systems.

ChatRAG-Bench

ChatRAG-Bench is a collection of datasets designed to evaluate the model's capability in conversational QA and RAG. It covers a wide range of documents and question types, requiring models to generate responses from retrieved context, comprehend and reason over tables, conduct arithmetic calculations, and indicate when questions cannot be answered within the context.

Long Document Datasets

ChatRAG-Bench includes long document datasets that cannot fit into LLMs with a sequence length of 4K or 8K tokens. Examples include:

Doc2Dial (D2D): A document-grounded conversational QA dataset covering domains like DMV, SSA, VA, and Student Aid.

QuAC: Based on Wikipedia documents, containing unanswerable cases.

QReCC: An open-domain conversational QA dataset across multiple sources.

TopiOCQA (TCQA): Requires the agent to search the entire Wikipedia for answers.

INSCIT: Studies cases where user questions are under-specified and require clarification.

For Doc2Dial, QuAC, and QReCC, documents are segmented into 300-word chunks, and the top-5 relevant chunks are retrieved as context for each user question. For TopiOCQA and INSCIT, smaller chunks are used, and the top-20 chunks are retrieved to provide similar context length.

Short Document Datasets

ChatRAG-Bench also includes short document datasets that can fit into LLMs with a sequence length of 4K tokens. Examples include:

CoQA: A conversational QA dataset with short passages covering various domains.

DoQA: Covers domains like cooking, travel, and movies from Stack Exchange forums and contains unanswerable cases.

ConvFinQA (CFQA): Based on financial reports, involving arithmetic calculations.

SQA: Grounded on documents containing only a single table from Wikipedia.

HybriDial (HDial): Contains both Wikipedia tabular and textual data.

Results

This benchmark evaluation assesses various LLMs (Large Language Models) for their performance on ChatRAG-Bench, a suite of task-specific metrics typically used to evaluate Retrieval-Augmented Generation (RAG) performance. The table provides mean scores across different datasets or tasks for several models. Here’s a detailed commentary on the results:

Benchmark Overview

The table includes nine different models: ChatQA-1.0-7 B, Command-R-Plus, Llama3-instruct-70 b, GPT-4-0613, GPT-4-Turbo, ChatQA-1.0-70 B, ChatQA-1.5-8 B, and ChatQA-1.5-70B.

Top Performers

ChatQA-1.5-70B consistently scores among the highest across most tasks, indicating strong performance. Particularly notable scores include:

- ConvFinQA: 81.88

- SQA: 83.82

GPT-4-Turbo also shows impressive results for ConvFinQA (84.16) and DoQA (51.94)

Average Performance

The overall averages indicate that ChatQA-1.5-70B (58.25) performs better than the other models, followed by ChatQA-1.5-8B (55.17) and GPT-4-Turbo (53.99). Excluding the HybriDial dataset, the trend remains consistent with ChatQA-1.5-70B leading (57.14), demonstrating its robustness even without the influence of particular datasets.

Specific Observations

For the SQA dataset, the highest scores are for ChatQA-1.5-70B (83.82) and ChatQA-1.5-8B (73.28), indicating strong model performance in structured querying tasks. In the HybriDial task, high scores for GPT-4-Turbo (56.44) and ChatQA models suggest their effectiveness in handling dialogue contexts requiring both retrieval and generation. The INSCIT dataset shows lower scores across the board, with the maximum being 36.34 from GPT-4-0613, indicating room for improvement in information synthesis tasks.

Insights on Dataset Performance

ConvFinQA emerges as a task where several models, notably GPT-4-Turbo and ChatQA-1.5-70B, excel, reaching upwards of 81. Conversely, datasets like INSCIT and QuAC have generally seen lower performance, suggesting they are more challenging and could be focal points for improving model capability.

Evaluation

ChatRAG-Bench evaluation dataset using the following metrics as per their ChatQA paper.

F1-Score: The F1 score is calculated based on token matching with the ground truth answer. It measures the overlap between the model's generated answer and the correct answer at the token level. F1 is the harmonic mean of precision and recall. The F1 score is used as an evaluation metric on all datasets except ConvfinQA.

Exact Match Accuracy: This metric measures the percentage of model responses that perfectly match the ground truth answer, word for word. Exact Match Accuracy as evaluation metric for ConvFinQA dataset.

Although a widely accepted metric, the F1 score used for ChatRAG-Bench has limitations—especially in capturing the semantic nuances and contexts of generated responses.

Sentence Variability: Different answers like "No, she was not happy," "No, she was sad," will receive different F1 scores despite the same meaning.

Partial Matches: Answers that do not perfectly align with the ground truth, such as "She used a can of orange paint to paint herself orange" versus "she painted herself," will score less than 1 despite their relevance, highlighting the token-level matching imperfection.

These issues underscore the need for a more holistic and context-aware metric to better gauge model competence.

Hallucination Index

The Hallucination Index is an ongoing initiative to evaluate and rank the largest and most popular LLMs based on their propensity to hallucinate in RAG tasks.

Methodology

The evaluation process challenges the models' abilities to stay on task and provide accurate, contextually relevant responses. It evaluates the most popular LLMs available today. These models are selected based on surveys of popular LLM repositories, leaderboards, and industry surveys.

LLMs are tested across three common task types – short, medium, and long context lengths, each presenting unique challenges and benefits.

Short Context RAG

For short context lengths, the evaluation utilizes a variety of demanding datasets to test the robustness of models. One key methodology is ChainPoll with GPT-4o, which leverages the strong reasoning power of GPT series models. By using a chain of thought technique to poll the model multiple times, ChainPoll quantifies potential hallucinations along with explanations.

DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs.

Microsoft MS Macro: Contains queries and paragraphs with relevance labels.

HotpotQA: Requires finding and reasoning over multiple supporting documents.

ConvFinQA: Focuses on numerical reasoning in conversational QA.

Medium and Long Context RAG

The methodology for medium and long context lengths focuses on the models' ability to comprehensively understand extensive texts. Text is extracted from recent documents, divided into chunks, and one chunk is designated as the "needle chunk." Retrieval questions are constructed to test the model's ability to find and use the needle chunk within varying context lengths:

Medium context lengths: 5k, 10k, 15k, 20k, 25k tokens.

Long context lengths: 40k, 60k, 80k, 100k tokens.

The task design ensures that all text in the context is from a single domain, responses are correct with short context, and questions cannot be answered from pre-training memory. This approach measures the influence of the needle chunk's position and avoids standard datasets to prevent test leakage.

Prompting Techniques

The evaluation also experiments with a prompting technique known as Chain-of-Note, which has shown promise in improving performance for short contexts. This technique is tested to see if it similarly benefits medium and long contexts.

Evaluation

Context Adherence evaluates the degree to which a model's response aligns with the given context, serving as a metric to gauge closed-domain hallucinations. A higher Context Adherence score indicates that the response contains only information from the provided context, while a lower score suggests the inclusion of information not in the context.

The Context Adherence score is calculated using ChainPoll which leverages the strong reasoning abilities of LLMs to assess the accuracy of responses. This approach not only quantifies potential hallucinations but also provides context-based explanations. It works for both answerable and unanswerable cases.

The final score is calculated as the mean of the scores for each task dataset. The dataset score is the mean of the ChainPoll score for each sample.

Limitations of Benchmarks

While each benchmark provides valuable insights, they also have their shortcomings.

ChatRAG-Bench

Conversational Focus: This benchmark primarily evaluates conversational RAG, which may not fully capture the model's performance in single-turn RAG scenarios.

Short Contexts: The evaluation is limited to context length of 10k, potentially overlooking the model's capabilities with longer contexts.

Evaluation Metrics: It uses F1 or exact match metrics for automated evaluation, which can sometimes over-penalize or under-penalize the model, leading to inaccurate scores.

Hallucination Index

Single-Turn Focus: This benchmark focuses on single-turn RAG capabilities, leaving its performance in multi-turn chat scenarios uncertain.

Negative Rejection: It does not account for the model's ability to reject incorrect answers when no relevant information is available.

General Limitations

Domain Coverage: Both benchmarks cover only a few domains, which may not accurately represent performance in specialized fields like medical or legal domains.

Dataset Leakage: Both benchmarks are built on open datasets, which can lead to data leakage during the training of the models being evaluated.

By understanding these limitations, users can better interpret the results of these benchmarks and make more informed decisions about the performance and reliability of their LLMs in various RAG scenarios.

Evaluate LLMs for RAG

Now that we understand the challenges of evaluating RAG systems, lets look at the basic steps involved in evaluation.

Load the embedding model
Create a new vector index
Vectorise chunks and add to the index
Load the chain
Prepare Galileo callback with metrics
Run the chain with questions to generate the answer
Sync data to the Galileo console