Platform

Resources

About

Book a Demo

Back

Feb 14, 2024

Mastering RAG: 4 Metrics to Improve Performance

Pratik Bhavsar

Galileo Labs

Pratik Bhavsar

Galileo Labs

Improve RAG Performance With 4 Powerful RAG Metrics

Explore our research-backed evaluation metric for RAG – read our paper on Chainpoll.

Retrieval Augmented Generation (RAG) has become the technique of choice for domain-specific generative AI systems. But despite its popularity, the complexity of RAG systems poses challenges for evaluation and optimization, often requiring labor-intensive trial-and-error with limited visibility.

So, how can AI builders improve the performance of their RAG systems? Is there a better way? Before we dive into a powerful, new cutting-edge approach, let’s recap the core components of a RAG system and why teams choose RAG to begin with.

A Brief Intro To RAG

Let’s start with a basic understanding of how a RAG system works.

RAG works by dynamically retrieving relevant context from external sources, integrating it with user queries, and feeding the retrieval-augmented prompt to an LLM for generating responses.

To build the system, we must first set up the vector database with the external data by chunking the text, embedding the chunks, and loading them into the vector database. Once this is complete, we can orchestrate the following steps in real time to generate the answer for the user:

Retrieve: Embedding the user query into the vector space to retrieve relevant context from an external knowledge source.

Augment: Integrating the user query and the retrieved context into a prompt template.

Generate: Feeding the retrieval-augmented prompt to the LLM for the final response generation.

An enterprise RAG system consists of dozens of components like storage, orchestration and observability. Each component is a large topic in itself which requires its own comprehensive blog. Thankfully, we’ve written just that earlier in our Mastering RAG series.

But you can build a basic RAG system with only a vector database, LLM, embedding model, and orchestration tool.

Vector database: A Vector DB, like Pinecone or Weaviate, stores vector embeddings of our external source documents.

LLM: Language models such as OpenAI or LLama serve as the foundation for generating responses.

Embedding model: Often derived from the LLM, the embedding model plays a crucial role in creating meaningful text representation.

Orchestration tool: An orchestration tool like Langchain/Llamaindex/DSPy is used to manage the workflow and interactions between components.

Advantages of RAG

Why go with RAG to begin with? To understand RAG better we recently broke down the pros and cons of RAG vs fine-tuning. Here are some of the top benefits of choosing RAG.

Dynamic data environments

RAG excels in dynamic data environments by continuously querying external sources, ensuring the information used for responses remains current without the need for frequent model retraining.

Hallucination resistance

RAG significantly reduces the likelihood of hallucinations, grounding each response in retrieved evidence. This feature enhances the reliability and accuracy of generated responses, especially in contexts where misinformation is detrimental.

Transparency and trust

RAG systems offer transparency by breaking down the response generation into distinct stages. This transparency provides users with insights into data retrieval processes, fostering trust in the generated outputs.

Implementation challenges

Implementing RAG requires much less expertise than fine-tuning. While setting up retrieval mechanisms, integrating external data sources, and ensuring data freshness can be complex, various pre-built RAG frameworks and tools simplify the process significantly.

Challenges in RAG Systems

Despite its advantages, RAG evaluation, experimentation, and observability are notably manual and labor-intensive. The inherent complexity of RAG systems, with numerous moving parts, makes optimization and debugging challenging, especially within intricate operational chains.

Limited chunking evaluation

It’s difficult to assess the impact of chunking on RAG system outputs, hindering efforts to enhance overall performance.

Embedding model evaluation

Opaque downstream effects make evaluating the effectiveness of the embedding model particularly challenging.

LLM evaluation - contextual ambiguity

Balancing the role of context in RAG systems presents a unique tradeoff between the risk of hallucinations or insufficient context for user queries.

LLM evaluation - prompt optimization

Various prompting techniques have been developed to enhance RAG performance, but determining the most effective one for the data remains challenging.

Inconsistent evaluation metrics

The absence of standardized metrics makes it tough to comprehensively assess all components of RAG systems, impeding a holistic understanding of the system’s performance.

RAG Evaluation

Benefits of Galileo’s GenAI Studio RAG Analytics

To solve these problems, Galileo’s RAG analytics facilitate faster and smarter development by providing detailed RAG evaluation metrics with unmatched visibility. Our four cutting edge metrics help AI builders optimize and evaluate both the LLM and Retriever sides of their RAG systems.

Chunk Attribution: A chunk-level boolean metric that measures whether a ‘chunk’ was used to compose the response.

Chunk Utilization: A chunk-level float metric that measures how much of the chunk text that was used to compose the response.

Completeness: A response-level metric measuring how much of the context provided was used to generate a response

Context Adherence: A response-level metric that measures whether the output of the LLM adheres to (or is grounded in) the provided context.

Without further ado let's see things in action!

Example: Q&A RAG System

Let's put it all together by building our own RAG system. We’ll use an example of a question-answering system for beauty products. We’ll start by extracting questions from the product descriptions using GPT-3.5-turbo, and subsequently utilize these questions in our RAG system to generate answers. We’ll evaluate the RAG system performance using GenAI Studio and our previously mentioned RAG analytics metrics – Context Adherence, Completeness, Chunk Attribution, and Chunk Utilization.

Here's a breakdown of the steps we’ll take to build our Q&A system:

1. Prepare the Vector Database

2. Generate Questions with GPT

3. Define our QA Chain

4. Choose Galileo Scorers

5. Evaluate RAG Chain

6. RAG Experimentation

Prepare The Vector Database

First we have to prepare our vector database. Let’s install the dependencies required for the RAG evaluation.

langchain==0.1.4
langchain-community==0.0.15
langchain-openai==0.0.5
promptquality[arrow]==0.28.1
openai==1.10.0
pinecone-client==3.0.1
datasets==2.16.1
spacy==3.7.2
sentence-transformers

Dataset

We obtained a subset of data from Kaggle, specifically sourced from the BigBasket (e-commerce) website. This dataset encompasses details about various consumer goods, and we narrowed it down by selecting only 500 products for analysis.

Source | Data

import pandas as pd

# BigBasket dataset
# https://www.kaggle.com/datasets/surajjha101/bigbasket-entire-product-list-28k-datapoints
df = pd.read_csv("../data/bigbasket.csv")
df = df[df['brand'].isin(["BIOTIQUE", "Himalaya", "Loreal Paris", "Nivea", "Nivea Men", "Kaya Clinic", "Mamaearth", "Lakme", "Schwarzkopf", "Garnier", "Fiama"])]
df = df.drop_duplicates(subset=["product"])
rows = 500
df.iloc[:rows].to_csv("../data/bigbasket_beauty.csv", index=False)

Chunking

For chunking we leverage the RecursiveCharacterTextSplitter with a default settings chunk_size of 4,000 and chunk_overlap of 200. Because our descriptions are less than 4,000 characters, chunking does not happen leading to 50 chunks; we’re using these settings to illustrate problems that can occur with default settings.

We define some common utils for the experiments.

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
import spacy

class SpacySentenceTokenizer():
    def __init__(self, spacy_model="en_core_web_sm"):
        self.nlp = spacy.load(spacy_model)
        self._chunk_size = None
        self._chunk_overlap = None

    def create_documents(self, documents, metadatas=None):
        chunks = []
        for doc, metadata in zip(documents, metadatas):
            for sent in self.nlp(doc).sents:
                chunks.append(Document(page_content=sent.text, metadata=metadata))
        return chunks

def get_indexing_configuration(config):
    if config == 1:
        text_splitter = SpacySentenceTokenizer()
        text_splitter_identifier = "sst"
        emb_model_name, dimension, emb_model_identifier = "text-embedding-3-small", 1536, "openai-small"
        embeddings = OpenAIEmbeddings(model=emb_model_name, tiktoken_model_name="cl100k_base")
        index_name = f"beauty-{text_splitter_identifier}-{emb_model_identifier}"
        
    elif config == 2:
        text_splitter = SpacySentenceTokenizer()
        text_splitter_identifier = "sst"
        emb_model_name, dimension, emb_model_identifier = "text-embedding-3-large", 1536*2, "openai-large"
        embeddings = OpenAIEmbeddings(model=emb_model_name, tiktoken_model_name="cl100k_base")
        index_name = f"beauty-{text_splitter_identifier}-{emb_model_identifier}"
        
    elif config == 3:
        text_splitter = SpacySentenceTokenizer()
        text_splitter_identifier = "sst"
        emb_model_name, dimension, emb_model_identifier = "all-MiniLM-L6-v2", 384, "all-minilm-l6"
        embeddings = HuggingFaceEmbeddings(model_name=emb_model_name, encode_kwargs = {'normalize_embeddings': True, 'show_progress_bar': False})
        index_name = f"beauty-{text_splitter_identifier}-{emb_model_identifier}"
        
    elif config == 4:
        text_splitter = SpacySentenceTokenizer()
        text_splitter_identifier = "sst"
        emb_model_name, dimension, emb_model_identifier = "all-mpnet-base-v2", 768, "all-mpnet"
        embeddings = HuggingFaceEmbeddings(model_name=emb_model_name, encode_kwargs = {'normalize_embeddings': True, 'show_progress_bar': False})
        index_name = f"beauty-{text_splitter_identifier}-{emb_model_identifier}"
        
    elif config == 5:
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)
        text_splitter_identifier = "rc"
        emb_model_name, dimension, emb_model_identifier = "text-embedding-3-small", 1536, "openai-small"
        embeddings = OpenAIEmbeddings(model=emb_model_name, tiktoken_model_name="cl100k_base")
        index_name = f"beauty-{text_splitter_identifier}-cs{text_splitter._chunk_size}-co{text_splitter._chunk_overlap}-{emb_model_identifier}"
        
    return text_splitter, embeddings, emb_model_name, dimension, index_name

Let's chunk the data using config 1. We ensure that queries containing the product name align with the description chunks by appending the product name at the beginning of each chunk.

import sys, os, time

from pinecone import Pinecone, ServerlessSpec
from dotenv import load_dotenv
from datasets import load_dataset
import pandas as pd

from langchain_community.vectorstores import Pinecone as langchain_pinecone
from common import SpacySentenceTokenizer, get_indexing_configuration

load_dotenv("../.env")

df = pd.read_csv("../data/bigbasket_beauty.csv")

indexing_config = 1
text_splitter, embeddings, emb_model_name, dimension, index_name = get_indexing_configuration(indexing_config)

chunks = text_splitter.create_documents(df.description.values, metadatas=[{"product_name": i} for i in df["product"].values] )
def add_product_name_to_page_content(chunk):
    chunk.page_content = f"Product name: {chunk.metadata['product_name']}\n{chunk.page_content}"
    chunk.metadata = {}

for chunk in chunks:
    add_product_name_to_page_content(chunk)
    
print(chunks[0].page_content)

We leverage Pinecone’s Serverless vector database, employing the cosine similarity metric. Utilizing the Pinecone Python client, we actively add documents to the index.

# instantiate a Pinecone client
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

# First, check if our index already exists and delete stale index
if index_name in [index_info['name'] for index_info in pc.list_indexes()]:
    pc.delete_index(index_name)

# we create a new index
pc.create_index(name=index_name, metric="cosine", dimension=dimension, # The OpenAI embedding model uses 1536 dimensions`
                spec=ServerlessSpec(
                cloud="aws",
                region="us-west-2"
            ) )
time.sleep(10)

# index the docs in the database
docsearch = langchain_pinecone.from_documents(chunks, embeddings, index_name=index_name)

This completes our vector DB setup!

Generate Questions With GPT

We require questions to conduct the evaluation, but our dataset consists of only product descriptions. To obtain test questions for the chatbot, either we can manually create test questions for our chatbot or leverage an LLM to generate them. To make our lives easier, we harness the power of GPT-3.5-turbo by employing a specific prompt.

Let's load the dataset again.

import sys, os

from tqdm import tqdm
tqdm.pandas()

from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
import pandas as pd

load_dotenv("../.env")


df = pd.read_csv("big_basket_beauty.csv")
chat = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=1.0)

We employ a few-shot approach to create synthetic questions, directing the model to generate five distinct and perplexing questions by utilizing the product description. The model is instructed to incorporate the exact product name from the description into each question.

def get_questions(product_name, product_description):
    questions = chat.invoke(
    [
        HumanMessage(
            content=f"""Your job is to generate questions for the product descriptions such that it is hard to answer the question.


Example 1
Product name: Fructis Serum - Long & Strong
Product description: Garnier Fruits Long & Strong Strengthening Serum detangles unruly hair, softens hair without heaviness and helps stop split and breakage ends. It is enriched with the goodness of Grape seed and avocado oil that result in smoother, shinier and longer hair.  
Questions: 
- Does Garnier Fruits Long & Strong Strengthening Serum help in hair growth?
- Which products contain avocado oil?

Example 2
Product name: Color Naturals Creme Riche Ultra Hair Color - Raspberry Red
Product description: Garnier Color Naturals is creme hair colour which gives 100% Grey Coverage and ultra visible colour with 50% more shine. It has a superior Colour Lock technology which gives you a rich long-lasting colour that lasts up to 8 weeks. Color Naturals comes in a range of 8 gorgeous shades especially suited for Indian skin tones. It is available in an easy to use kit which can be used at your convenience in the comfort of your house! It is enriched with the goodness of 3 oils - Almond, Olive and Avocado which nourishes hair and provides shiny, long-lasting colour. Your hair will love the nourishment and you will love the colour!
Questions: 
- How long does Color Naturals Creme Riche Ultra Hair Color last?
- Which product for hair color is suited for indian skin?
- How many colors are available in Color Naturals Hair Color?

Product name: Black Naturals Hair Colour Shade 1-Deep Black 20 ml + 20 gm
Product description: It is an oil-enriched cream colour which gives natural-looking black hair. Works in 15 minutes, non-drip cream. Maintains softness, smoothness, and shine. No ammonia hair colour. Lasts for 6 weeks.
Questions: 
- Does Black Naturals Hair Colour contain ammonia?
- How muct time do you have to keep Grey Naturals Hair Colour on your hair?

Now generate 5 confusing questions which can be answered for this product based on description. Use the exact product name in the questions as mentioned in the description. There should not be duplicates in the 5 questions. Return questions starting with - instead of numbers.

Product name: {product_name}
Product description: {product_description}
Questions: """
        )
    ]
)
    questions = questions.content.replace("- ", "").split("\n")
    questions = list(filter(None, questions)) 
    return questions

sample["questions"] = sample.progress_apply(lambda x: get_questions(x["product"], x["description"]), axis=1)

sample.to_csv("questions.csv", index=False)

Define Our QA Chain

We build a standard QA using the RAG chain, utilizing GPT-3.5-turbo as the LLM and the same vector DB for retrieval.

import os

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema import StrOutputParser
from langchain_community.vectorstores import Pinecone as langchain_pinecone
from pinecone import Pinecone

def get_qa_chain(embeddings, index_name, k, llm_model_name, temperature):
    # setup retriever
    pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
    index = pc.Index(index_name)
    vectorstore = langchain_pinecone(index, embeddings.embed_query, "text")
    retriever = vectorstore.as_retriever(search_kwargs={"k": k})  # https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/vectorstores.py#L553

    # setup prompt
    rag_prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "Answer the question based only on the provided context."
            ),
            ("human", "Context: '{context}' \n\n Question: '{question}'"),
        ]
    )

    # setup llm
    llm = ChatOpenAI(model_name=llm_model_name, temperature=temperature)

    # helper function to format docs
    def format_docs(docs):
        return "\n\n".join([d.page_content for d in docs])

    # setup chain
    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | rag_prompt
        | llm
        | StrOutputParser()
    )
    
    return rag_chain

Choose Galileo Scorers

In the promptquality library, Galileo employs numerous scorers. We’re going to choose evaluation metrics that help measure system performance, including latency and safety metrics like PII, toxicity, and tone, as well as the four RAG metrics.

import promptquality as pq
from promptquality import Scorers

all_metrics = [
    Scorers.latency,
    Scorers.pii,
    Scorers.toxicity,
    Scorers.tone,
    #rag metrics below
    Scorers.context_adherence,
    Scorers.completeness_gpt,
    Scorers.chunk_attribution_utilization_gpt,
]

Custom scorer

In certain situations, the user may need a custom metric that better aligns with business requirements. In these instances, adding a custom scorer to the existing scorers is a straightforward solution.

from typing import Optional

#Custom scorer for response length
def executor(row) -> Optional[float]:
    if row.response:
        return len(row.response)
    else:
        return 0

def aggregator(scores, indices) -> dict:
    return {'Response Length': sum(scores)/len(scores)}

length_scorer = pq.CustomScorer(name='Response Length', executor=executor, aggregator=aggregator)
all_metrics.append(length_scorer)

Now that we have everything ready let’s move on to evaluation.

Evaluate RAG Chain

To begin, load the modules and log in to the Galileo console through the console URL. A popup will appear, prompting you to copy the secret key and paste it into your IDE or terminal.

[ Contact us to get started with your Galileo setup ]

import random
from dotenv import load_dotenv

import pandas as pd
import promptquality as pq
from tqdm import tqdm

from common import get_indexing_configuration
from metrics import all_metrics
from qa_chain import get_qa_chain

load_dotenv("../.env")

# fixed values for the experiment
project_name = "feb10-qa"
temperature = 0.1

# experiment config
indexing_config = 1
llm_model_name, llm_identifier, k = "gpt-3.5-turbo-1106", "3.5-1106", 20

_, embeddings, emb_model_name, dimension, index_name = get_indexing_configuration(indexing_config)

console_url = "console.exp.rungalileo.io"
pq.login(console_url)

Randomly select 100 questions for the evaluation by loading all questions.

# Prepare questions for the conversation
df = pd.read_csv("../data/bigbasket_beauty.csv")
df["questions"] = df["questions"].apply(eval)
questions = df.explode("questions")["questions"].tolist()
random.Random(0).shuffle(questions)

questions = questions[:100] # selecting only first 100 turns

Load the chain and set up the handler with tags as you experiment with prompts, tuning various parameters. You might conduct experiments using different models, model versions, vector stores, and embedding models. Utilize Run Tags to effortlessly log any run details you wish to review later in the Galileo Evaluation UI.

run_name = f"{index_name}-{llm_identifier}-k{k}"

index_name_tag = pq.RunTag(key="Index config", value=index_name, tag_type=pq.TagType.RAG)
encoder_model_name_tag = pq.RunTag(key="Encoder", value=emb_model_name, tag_type=pq.TagType.RAG)
llm_model_name_tag = pq.RunTag(key="LLM", value=llm_model_name, tag_type=pq.TagType.RAG)
dimension_tag = pq.RunTag(key="Dimension", value=str(dimension), tag_type=pq.TagType.RAG)
topk_tag = pq.RunTag(key="Top k", value=str(k), tag_type=pq.TagType.RAG)

evaluate_handler = pq.GalileoPromptCallback(project_name=project_name, run_name=run_name, scorers=all_metrics, run_tags=[encoder_model_name_tag, llm_model_name_tag, index_name_tag, dimension_tag, topk_tag])

Let's evaluate each question by generating answers and, ultimately, push the Langchain data to the Galileo console to initiate metric calculations.

All we need to do is pass our evaluate handler callback to invoke.

print("Ready to ask!")
for i, q in enumerate(tqdm(questions)):
    print(f"Question {i}: ", q)
    print(qa.invoke(q, config=dict(callbacks=[evaluate_handler])))
    print("\n\n")

evaluate_handler.finish()

This brings us to the most exciting part of the build… 🥁🥁🥁

RAG Experimentation

Now that we have built the system with many parameters let’s run some experiments to improve it. The project view below shows the four RAG metrics of all runs.

We can also analyze system metrics for each run, helping us improve cost and latency. Additionally, safety-related metrics like PII and toxicity help monitor possibly damaging outputs.

System and safety metrics in Galileo GenAI Studio

Finally, we can examine the tags to understand the particular configuration utilized for each experiment.

Now let’s look at the experiments we conducted to improve the performance of our RAG system.

Select the embedding model

Initially, we will conduct experiments to determine the optimal encoder. Keeping the sentence tokenizer, LLM (GPT-3.5-turbo), and k (20) constant, we assess four different encoders:

1. all-mpnet-base-v2 (dim 768)

2. all-MiniLM-L6-v2 (dim 384)

3. text-embedding-3-small (dim 1536)

4. text-embedding-3-large (dim 1536*2)

Our guiding metric is context adherence, which measures hallucinations. The metrics for these four experiments are presented in the last four rows of the table above. Among them, text-embedding-3-small achieves the highest context adherence score, making it the winner for further optimization.

Embedding model evaluation in Galileo GenAI Studio

Within the run, it becomes evident that certain workflows (examples) exhibit low adherence scores.

In order to troubleshoot the issue we can go inside the workflow. The below image shows the workflow info with the i/o of the chain. On the left we can see the hierarchy of chains, on the right we get the workflow metrics, and in the center the i/o for chains.

Langchain workflow view in Galileo GenAI Studio

The generation's poor quality is frequently linked to inadequate retrieval. To know how to move forward, let’s analyze the quality of chunks obtained from retrieval. The attribute-to-output (see screenshot below) informs us whether the chunk was utilized in the generation process.

Retrieval chain view in Galileo GenAI Studio

In our example, the question is "Is Brightening Night Cream suitable for all skin types?" Examining the chunks, none explicitly states that "Brightening Night Cream" is suitable for all skin types. This presents a classic case of hallucination resulting in low context adherence. The following provides a detailed explanation of why this generation received a low context adherence score.

“0.00% GPT judges said the model's response was grounded or adhering to the context. This was the rationale one of them gave: The claim that the Brightening Night Cream is suitable for all skin types is not fully supported by the documents. While some products mention that the Brightening Night Cream is suitable for all skin types, not all instances explicitly state this. Therefore, based on the provided context, it is unclear if the Brightening Night Cream is universally suitable for all skin types.”

Context adherence explanation in Galileo GenAI Studio

Select the right chunker

Next we keep the same embedding model (text-embedding-3-small), LLM(gpt-3.5-turbo), k(20) and try recursive chunking with chunk size of 200 and chunk overlap of 50. This alone leads to a 4% improvement in adherence. Isn’t that amazing!

Improved context adherence due to better chunking

Improving top k

From the experiments, we observe that chunk attribution remains in the single digits, hovering around 8%. This indicates that less than 10% of the chunks are useful. Recognizing this opportunity, we decide to conduct an experiment with a reduced top k value. We choose to run the experiment with a k value of 15 instead of 20. The results show an increase in attribution from 8.9% to 12.9%, and adherence improves from 87.3% to 88.3%. We’ve now reduced costs while improving performance!

Improved context adherence with lower top k

The cost significantly decreases from $0.126 to $0.098, marking a substantial 23% reduction!

Improve cost and latency

Now, let's embark on one final experiment to really push the envelope. We adopt our latest and best configuration, utilizing text-embedding-3-small, recursive chunking with a chunk size of 200 and a chunk overlap of 50. Additionally, we adjust the k value to 15 and switch the LLM to gpt-3.5-turbo-0125 (the latest release from OpenAI).

The results are quite surprising – there is a significant 22% reduction in latency and a substantial 50% decrease in cost. However, it comes with the tradeoff of a drop in adherence from 88.3 to 84.3.

Caption: Lower context adherence after switching LLM to gpt-3.5-turbo-0125

Lower cost and latency after switching LLM to gpt-3.5-turbo-0125

Like many situations, users need to consider the tradeoff between performance, cost, and latency for their specific use case. They can opt for a high-performance system with a higher cost or choose a more economical solution with slightly reduced performance.

Recap

We’ve now demonstrated how Galileo’s GenAI Studio can give you unmatched visibility into your RAG workflows. As we saw, the RAG and system-level metrics streamline the selection of configurations and enable on-going experimentation to maximize performance while minimizing cost and latency.

In only an hour, we reduced hallucinations, increased retrieval speed, and cut costs in half!.

Watch a recording of this Q&A example to see GenAI Studio in action

Conclusion

The complexity of RAG demands innovative solutions for evaluation and optimization. While the benefits of RAG over fine-tuning make it an attractive choice, manual and time-consuming evaluation and experimentation limit its potential for AI teams.

Galileo's RAG analytics offer a transformative approach, providing unparalleled visibility into RAG systems and simplifying evaluation to improve RAG performance. Sign up for your free Galileo account today, or continue your Mastering RAG journey with our free, comprehensive eBook.