Content
Mastering RAG: How To Observe Your RAG Post-Deployment
Apr 4, 2024
If you’ve reached this point, you’re well on your way to becoming a RAG master. Congrats! 🎉
But just because your RAG system is live and performing well doesn’t mean your job is done. Observing and monitoring systems post-deployment is crucial for identifying potential risks and maintaining reliability. As any seasoned developer knows, the true test of a system's resilience lies in its ability to adapt and evolve over time. This is where the importance of post-deployment observation and monitoring becomes paramount.
Buckle up to master RAG observation like never before!
GenAI Monitoring vs Observability
Though often conflated, monitoring and observability are actually related aspects of the GenAI lifecycle. Conventional monitoring entails tracking predetermined metrics to assess system health and performance, while GenAI observability offers insights into the inputs and outputs of a workflow, along with every intervening step.
For example, in the context of RAG, observability allows users access to a particular node, like the retriever node, to get a comprehensive overview of all the chunks retrieved by the retriever. This functionality proves invaluable when debugging executions, enabling users to trace subpar responses back to the specific step where errors occurred.
Four Key Aspects of GenAI Observability
Let’s dive deeper into the distinct parts of a comprehensive GenAI observability platform.
Chain Execution Information
Observing the execution of the processing chain, especially in the context of Langchain LLM chains, is crucial for understanding system behavior and identifying points of failure. This entails tracking the flow of data and operations within the chain, from the retrieval of context to the generation of responses.
Retrieved Context
Observing the retrieved context from your optimised vector database is essential for assessing the relevance and adequacy of information provided to the language model. This involves tracking the retrieval process, including the selection and presentation of context to the model.
ML Metrics
ML metrics provide insights into the performance and behavior of the language model itself, including aspects such as adherence to context.
System Metrics
System metrics provide insights into the operational health and performance of the RAG deployment infrastructure, including aspects such as resource utilization, latency, and error rates.
By effectively observing these four aspects, teams can gain comprehensive insights into RAG performance and behavior.
RAG Risks in Production
In production environments, RAG systems encounter numerous challenges and risks that can undermine their performance and reliability, from system failures to inherent limitations in model behavior. Let’s review some of these potential risks.
Evaluation Complexity
In the post-deployment phase of RAG systems, evaluating performance becomes increasingly complex, particularly as the volume of chain runs escalates. Manual evaluation, while essential, can quickly become labor-intensive and impractical with thousands of iterations. To address this challenge, automated metrics play a pivotal role in streamlining the evaluation process and extracting actionable insights from the vast amount of data generated.
Automated evaluation metrics help answer complex questions such as:
Is my reranker the issue? Automated metrics can analyze the impact of the reranking component on overall system performance, highlighting areas where optimization may be required.
What about our chunking technique? By examining metrics related to chunk utilization and attribution, teams can assess the effectiveness of chunking techniques and refine strategies to enhance model efficiency.
Automated evaluation not only accelerates the evaluation process but also enables deeper insights into system performance, facilitating informed decision-making and continuous improvement of RAG.
Hallucinations
In a notable incident, a hallucination by Canada's largest airline was deemed legally binding after its chatbot provided inaccurate information, resulting in the customer purchasing a full-price ticket. Such incidents highlight the potential consequences of relying on systems without adequate oversight and comprehensive observability.
Toxicity
Models can exhibit toxic behavior when probed in specific ways or if subjected to unauthorized modifications. Instances of chatbots inadvertently learning and deploying harmful language underscore the risks associated with deploying AI systems without observability or control over their behavior.
Safety
Jailbreaking or injecting prompts into the model can transform it into a potentially harmful entity, capable of disseminating harmful content. This poses significant safety concerns, especially when AI models are accessed or manipulated by malicious actors.
Failure Tracing
Tracing failures within the RAG system can be challenging, particularly when determining which component — retrieval, prompt, or LLM — contributed to the failure. Lack of clear visibility into the system's internal workings complicates the process of identifying and resolving issues effectively.
Metrics for Monitoring
Monitoring RAG systems requires tracking several metrics to identify potential issues. By setting up alerts on these metrics, AI teams can effectively monitor system performance and proactively address these issues. Let's look at some of the most useful metrics.
Generation Metrics
Generation metrics provide crucial insights into the language model's performance and behavior, shedding light on its safety issues, precision and recall in generating the answer.
Retrieval Metrics
Retrieval metrics offer insights into the chunking and embedding performance of the system, influencing the quality of retrieved information.
System Metrics
System metrics are instrumental in monitoring the operational health, performance, and resource utilization of the RAG deployment infrastructure, ensuring optimal functionality and user experience.
Product metric
In addition to traditional monitoring and observability techniques, incorporating user feedback mechanisms, such as thumbs-up/thumbs-down ratings or star ratings, can provide valuable insights into the user satisfaction of RAG systems.
By leveraging these metrics, organizations can gain comprehensive insights to enable proactive maintenance and improvement.
How to Observe RAG Post-Deployment
Project setup
Enough theory; let’s see observability in action. We’ll continue with the example we built last time in our embedding evaluation blog.
Let's start with creating an Observe project.
Next, let’s select the metrics that interest us. For this example, we have selected RAG and safety metrics.
To begin, log in to the console and configure OpenAI credentials to generate answers.
Import the necessary requirements for conducting the experiment.
Generate the questions you wish to simulate using the method outlined in the embedding blog. This method utilizes GPT to generate the questions.
Define the RAG chain executor and utilize the GalileoObserveCallback for logging the chain interactions.
Now, execute the simulation using the given questions.
We also test with synthetic questions to assess tone, Personally Identifiable Information (PII), and toxicity metrics.
We do the same with PII questions
We do the same with toxic questions.
Project analysis
Now that we've finished running the simulation, we can access the project in the console and observe the chart view. Here, we can analyze the changing metrics over time, providing insights into the current system performance.
To analyse the chain we can click on the data tab and then get all the metrics for each sample. Potential issues are highlighted in red for ease of finding them. We see that some of the chains have low attribution and utilization.
Similarly we can see safety metrics for each run - tone, toxicity, sexism and PII.
We can do further analysis of the chain by clicking it to see the nodes executed.
We can go inside the nodes to analyze the chain inputs and outputs. Over here we can see the retrieved context.
Apart from this, if you wish to monitor a metric falling below a specific threshold, you can keep alerts to keep you informed about the system's status.This helps us fix issues before they escalate.
In this manner, we can craft a comprehensive strategy for continuous improvement, ensuring that our RAG system remains performant with evolving user needs. By harnessing the power of observability, teams can establish a feedback loop that drives iterative refinement and optimization across all facets of the RAG system.
Conclusion
Mastering RAG goes beyond mere deployment – it's about a relentless cycle of observation and enhancement. Understanding the nuances between monitoring and observability is pivotal for swiftly diagnosing issues in production. Given the volatile nature of this environment, where risks lurk around every corner, maintaining a seamless user experience and safeguarding brand reputation is paramount. Through the implementation of a robust feedback loop driven by observability, teams can operate RAG at peak performance.
Sign up for your free Galileo account today, or continue your Mastering RAG journey with our free, comprehensive eBook.

Share this post