
Content
Mastering RAG: 4 Metrics to Improve Performance
Feb 14, 2024
Explore our research-backed evaluation metric for RAG – read our paper on Chainpoll.
Retrieval Augmented Generation (RAG) has become the technique of choice for domain-specific generative AI systems. But despite its popularity, the complexity of RAG systems poses challenges for evaluation and optimization, often requiring labor-intensive trial-and-error with limited visibility.
So, how can AI builders improve the performance of their RAG systems? Is there a better way? Before we dive into a powerful, new cutting-edge approach, let’s recap the core components of a RAG system and why teams choose RAG to begin with.
A Brief Intro To RAG
Let’s start with a basic understanding of how a RAG system works.
RAG works by dynamically retrieving relevant context from external sources, integrating it with user queries, and feeding the retrieval-augmented prompt to an LLM for generating responses.
To build the system, we must first set up the vector database with the external data by chunking the text, embedding the chunks, and loading them into the vector database. Once this is complete, we can orchestrate the following steps in real time to generate the answer for the user:
Retrieve: Embedding the user query into the vector space to retrieve relevant context from an external knowledge source.
Augment: Integrating the user query and the retrieved context into a prompt template.
Generate: Feeding the retrieval-augmented prompt to the LLM for the final response generation.
An enterprise RAG system consists of dozens of components like storage, orchestration and observability. Each component is a large topic in itself which requires its own comprehensive blog. Thankfully, we’ve written just that earlier in our Mastering RAG series.
But you can build a basic RAG system with only a vector database, LLM, embedding model, and orchestration tool.
Vector database: A Vector DB, like Pinecone or Weaviate, stores vector embeddings of our external source documents.
LLM: Language models such as OpenAI or LLama serve as the foundation for generating responses.
Embedding model: Often derived from the LLM, the embedding model plays a crucial role in creating meaningful text representation.
Orchestration tool: An orchestration tool like Langchain/Llamaindex/DSPy is used to manage the workflow and interactions between components.
Advantages of RAG
Why go with RAG to begin with? To understand RAG better we recently broke down the pros and cons of RAG vs fine-tuning. Here are some of the top benefits of choosing RAG.
Dynamic data environments
RAG excels in dynamic data environments by continuously querying external sources, ensuring the information used for responses remains current without the need for frequent model retraining.
Hallucination resistance
RAG significantly reduces the likelihood of hallucinations, grounding each response in retrieved evidence. This feature enhances the reliability and accuracy of generated responses, especially in contexts where misinformation is detrimental.
Transparency and trust
RAG systems offer transparency by breaking down the response generation into distinct stages. This transparency provides users with insights into data retrieval processes, fostering trust in the generated outputs.
Implementation challenges
Implementing RAG requires much less expertise than fine-tuning. While setting up retrieval mechanisms, integrating external data sources, and ensuring data freshness can be complex, various pre-built RAG frameworks and tools simplify the process significantly.
Challenges in RAG Systems
Despite its advantages, RAG evaluation, experimentation, and observability are notably manual and labor-intensive. The inherent complexity of RAG systems, with numerous moving parts, makes optimization and debugging challenging, especially within intricate operational chains.
Limited chunking evaluation
It’s difficult to assess the impact of chunking on RAG system outputs, hindering efforts to enhance overall performance.
Embedding model evaluation
Opaque downstream effects make evaluating the effectiveness of the embedding model particularly challenging.
LLM evaluation - contextual ambiguity
Balancing the role of context in RAG systems presents a unique tradeoff between the risk of hallucinations or insufficient context for user queries.
LLM evaluation - prompt optimization
Various prompting techniques have been developed to enhance RAG performance, but determining the most effective one for the data remains challenging.
Inconsistent evaluation metrics
The absence of standardized metrics makes it tough to comprehensively assess all components of RAG systems, impeding a holistic understanding of the system’s performance.
RAG Evaluation
To solve these problems, Galileo’s RAG analytics facilitate faster and smarter development by providing detailed RAG evaluation metrics with unmatched visibility. Our four cutting edge metrics help AI builders optimize and evaluate both the LLM and Retriever sides of their RAG systems.
Chunk Attribution: A chunk-level boolean metric that measures whether a ‘chunk’ was used to compose the response.
Chunk Utilization: A chunk-level float metric that measures how much of the chunk text that was used to compose the response.
Completeness: A response-level metric measuring how much of the context provided was used to generate a response
Context Adherence: A response-level metric that measures whether the output of the LLM adheres to (or is grounded in) the provided context.
Without further ado let's see things in action!
Example: Q&A RAG System
Let's put it all together by building our own RAG system. We’ll use an example of a question-answering system for beauty products. We’ll start by extracting questions from the product descriptions using GPT-3.5-turbo, and subsequently utilize these questions in our RAG system to generate answers. We’ll evaluate the RAG system performance using GenAI Studio and our previously mentioned RAG analytics metrics – Context Adherence, Completeness, Chunk Attribution, and Chunk Utilization.
Here's a breakdown of the steps we’ll take to build our Q&A system:
1. Prepare the Vector Database
2. Generate Questions with GPT
3. Define our QA Chain
4. Choose Galileo Scorers
5. Evaluate RAG Chain
6. RAG Experimentation
Prepare The Vector Database
First we have to prepare our vector database. Let’s install the dependencies required for the RAG evaluation.
Dataset
We obtained a subset of data from Kaggle, specifically sourced from the BigBasket (e-commerce) website. This dataset encompasses details about various consumer goods, and we narrowed it down by selecting only 500 products for analysis.
Source | Data
Chunking
For chunking we leverage the RecursiveCharacterTextSplitter with a default settings chunk_size of 4,000 and chunk_overlap of 200. Because our descriptions are less than 4,000 characters, chunking does not happen leading to 50 chunks; we’re using these settings to illustrate problems that can occur with default settings.
We define some common utils for the experiments.
Let's chunk the data using config 1. We ensure that queries containing the product name align with the description chunks by appending the product name at the beginning of each chunk.
We leverage Pinecone’s Serverless vector database, employing the cosine similarity metric. Utilizing the Pinecone Python client, we actively add documents to the index.
This completes our vector DB setup!
Generate Questions With GPT
We require questions to conduct the evaluation, but our dataset consists of only product descriptions. To obtain test questions for the chatbot, either we can manually create test questions for our chatbot or leverage an LLM to generate them. To make our lives easier, we harness the power of GPT-3.5-turbo by employing a specific prompt.
Let's load the dataset again.
We employ a few-shot approach to create synthetic questions, directing the model to generate five distinct and perplexing questions by utilizing the product description. The model is instructed to incorporate the exact product name from the description into each question.
Define Our QA Chain
We build a standard QA using the RAG chain, utilizing GPT-3.5-turbo as the LLM and the same vector DB for retrieval.
Choose Galileo Scorers
In the promptquality library, Galileo employs numerous scorers. We’re going to choose evaluation metrics that help measure system performance, including latency and safety metrics like PII, toxicity, and tone, as well as the four RAG metrics.
Custom scorer
In certain situations, the user may need a custom metric that better aligns with business requirements. In these instances, adding a custom scorer to the existing scorers is a straightforward solution.
Now that we have everything ready let’s move on to evaluation.
Evaluate RAG Chain
To begin, load the modules and log in to the Galileo console through the console URL. A popup will appear, prompting you to copy the secret key and paste it into your IDE or terminal.
[ Contact us to get started with your Galileo setup ]
Randomly select 100 questions for the evaluation by loading all questions.
Load the chain and set up the handler with tags as you experiment with prompts, tuning various parameters. You might conduct experiments using different models, model versions, vector stores, and embedding models. Utilize Run Tags to effortlessly log any run details you wish to review later in the Galileo Evaluation UI.
Let's evaluate each question by generating answers and, ultimately, push the Langchain data to the Galileo console to initiate metric calculations.
All we need to do is pass our evaluate handler callback to invoke.
This brings us to the most exciting part of the build… 🥁🥁🥁
RAG Experimentation
Now that we have built the system with many parameters let’s run some experiments to improve it. The project view below shows the four RAG metrics of all runs.
We can also analyze system metrics for each run, helping us improve cost and latency. Additionally, safety-related metrics like PII and toxicity help monitor possibly damaging outputs.
Finally, we can examine the tags to understand the particular configuration utilized for each experiment.
Now let’s look at the experiments we conducted to improve the performance of our RAG system.
Select the embedding model
Initially, we will conduct experiments to determine the optimal encoder. Keeping the sentence tokenizer, LLM (GPT-3.5-turbo), and k (20) constant, we assess four different encoders:
1. all-mpnet-base-v2 (dim 768)
2. all-MiniLM-L6-v2 (dim 384)
3. text-embedding-3-small (dim 1536)
4. text-embedding-3-large (dim 1536*2)
Our guiding metric is context adherence, which measures hallucinations. The metrics for these four experiments are presented in the last four rows of the table above. Among them, text-embedding-3-small achieves the highest context adherence score, making it the winner for further optimization.
Within the run, it becomes evident that certain workflows (examples) exhibit low adherence scores.
In order to troubleshoot the issue we can go inside the workflow. The below image shows the workflow info with the i/o of the chain. On the left we can see the hierarchy of chains, on the right we get the workflow metrics, and in the center the i/o for chains.
The generation's poor quality is frequently linked to inadequate retrieval. To know how to move forward, let’s analyze the quality of chunks obtained from retrieval. The attribute-to-output (see screenshot below) informs us whether the chunk was utilized in the generation process.
In our example, the question is "Is Brightening Night Cream suitable for all skin types?" Examining the chunks, none explicitly states that "Brightening Night Cream" is suitable for all skin types. This presents a classic case of hallucination resulting in low context adherence. The following provides a detailed explanation of why this generation received a low context adherence score.
“0.00% GPT judges said the model's response was grounded or adhering to the context. This was the rationale one of them gave: The claim that the Brightening Night Cream is suitable for all skin types is not fully supported by the documents. While some products mention that the Brightening Night Cream is suitable for all skin types, not all instances explicitly state this. Therefore, based on the provided context, it is unclear if the Brightening Night Cream is universally suitable for all skin types.”
Select the right chunker
Next we keep the same embedding model (text-embedding-3-small), LLM(gpt-3.5-turbo), k(20) and try recursive chunking with chunk size of 200 and chunk overlap of 50. This alone leads to a 4% improvement in adherence. Isn’t that amazing!
Improving top k
From the experiments, we observe that chunk attribution remains in the single digits, hovering around 8%. This indicates that less than 10% of the chunks are useful. Recognizing this opportunity, we decide to conduct an experiment with a reduced top k value. We choose to run the experiment with a k value of 15 instead of 20. The results show an increase in attribution from 8.9% to 12.9%, and adherence improves from 87.3% to 88.3%. We’ve now reduced costs while improving performance!
The cost significantly decreases from $0.126 to $0.098, marking a substantial 23% reduction!
Improve cost and latency
Now, let's embark on one final experiment to really push the envelope. We adopt our latest and best configuration, utilizing text-embedding-3-small, recursive chunking with a chunk size of 200 and a chunk overlap of 50. Additionally, we adjust the k value to 15 and switch the LLM to gpt-3.5-turbo-0125 (the latest release from OpenAI).
The results are quite surprising – there is a significant 22% reduction in latency and a substantial 50% decrease in cost. However, it comes with the tradeoff of a drop in adherence from 88.3 to 84.3.
Like many situations, users need to consider the tradeoff between performance, cost, and latency for their specific use case. They can opt for a high-performance system with a higher cost or choose a more economical solution with slightly reduced performance.
Recap
We’ve now demonstrated how Galileo’s GenAI Studio can give you unmatched visibility into your RAG workflows. As we saw, the RAG and system-level metrics streamline the selection of configurations and enable on-going experimentation to maximize performance while minimizing cost and latency.
In only an hour, we reduced hallucinations, increased retrieval speed, and cut costs in half!.
Watch a recording of this Q&A example to see GenAI Studio in action
Conclusion
The complexity of RAG demands innovative solutions for evaluation and optimization. While the benefits of RAG over fine-tuning make it an attractive choice, manual and time-consuming evaluation and experimentation limit its potential for AI teams.
Galileo's RAG analytics offer a transformative approach, providing unparalleled visibility into RAG systems and simplifying evaluation to improve RAG performance. Sign up for your free Galileo account today, or continue your Mastering RAG journey with our free, comprehensive eBook.

Share this post