Oct 8, 2025
How to Continuously Improve Your LangGraph Multi-Agent System


We've all been trapped in a chatbot loop that keeps asking, "I'm sorry, I didn't understand that. Can you rephrase?" Or worse, a bot that cheerfully claims it can help with everything, then fails at everything. You ask about a billing error, but it suggests restarting your router.
Most chatbots attempt to be generalists, but they often end up being proficient in nothing. Real customer service works differently. When you call your telecom provider, you get routed to specialists who actually understand your specific problem.
That's exactly what we're building today! A multi-agent system for ConnectTel that creates delightful customer experiences through intelligent routing to specialized agents. We'll use LangGraph to build it and Galileo for real-time agent metrics, failure insights, and performance tracking so you can continuously improve what's actually working.
Welcome to the fifth installment of the multi-agent series. Don't miss out on other pieces in the Mastering Agent series.
Agent Architecture with Observability

Our system employs a supervisor pattern, where a main coordinator routes customer queries to specialized agents based on the customer's actual needs. We will leverage Chainlit for the conversational UI, Pinecone for vector-based knowledge retrieval, and GPT-4.1 as the underlying LLM.
The beauty of this design is that each agent can be developed, tested, and improved independently. When billing logic changes, you don't touch the technical support code. When you need to add a new capability, you add a new agent without rewriting everything.
However, we don't just want to build the agent; we also want to enable continuous improvement. This will require insights into the failures that happen when our users ask questions we never anticipated and expose edge cases. That is why we are building observability into our system from day one.
Here is how our continuous improvement cycle looks:
Monitor: Track every agent decision, tool call, and routing choice. You need to see not just what agents answer, but how they arrive at those answers. Which agent handled the query? What tools did they use? How long did each step take?
Debug: When things go wrong (and they will), trace through the entire chain. Did the supervisor route to the wrong agent? Did the tool return unexpected data? Did the agent misinterpret the response? Real production failures teach you more than any synthetic test.
Improve: Make targeted fixes based on actual data. Perhaps your billing agent requires more effective prompts for international charges. Technical support may need a timeout for slow diagnostics. The improvements can be measured and you'll know immediately if it worked.
This isn't a one-time optimization. It's an ongoing practice that separates production-ready systems from demos. The architecture we're about to build supports this naturally, with clear separation between agents and comprehensive tracking at every step.
Quick Setup Guide
Let's get this running on your machine. You'll need Python 3.9+, an OpenAI API key, Pinecone and Galileo accounts for the full experience. The complete source code is available in the repository, ready for you to adapt to your own use case.
Installation Steps
First, clone and navigate to the project:
git clone https://github.com/rungalileo/sdk-examples git checkout feature/langgraph-telecom-agent cd sdk-examples/python/agent/langgraph-telecom-agent
Configure your environment variables:
cp .env.example .env # Edit .env with your keys: OPENAI_API_KEY PINECONE_API_KEY GALILEO_API_KEY GALILEO_PROJECT="Multi-Agent Telecom Chatbot" GALILEO_LOG_STREAM="prod"
Install dependencies using uv (recommended) or pip:
# Using uv uv sync --dev # Or using pip pip install -e
In order for agent to act in line with the company guidelines and product, they need to be made aware of the right context. For this, we create a document with the guidelines.
This is the snippet from the troubleshooting guide, which will be chunked and indexed to be used by the Technical Support agent using retrieval.
# Network Troubleshooting Guide ## Common Network Issues and Solutions ### No Signal / No Service #### Symptoms - "No Service" or "Searching" message - Unable to make calls or send texts - No data connection #### Solutions 1. **Toggle Airplane Mode** - Turn ON for 30 seconds - Turn OFF and wait for reconnection 2. **Restart Device** - Power off completely - Wait 30 seconds - Power on 3. **Check SIM Card** - Remove and reinsert SIM - Clean SIM contacts with soft cloth - Try SIM in another device 4. **Reset Network Settings** - iOS: Settings > General > Reset > Reset Network Settings - Android: Settings > System > Reset > Reset Network Settings 5. **Update Carrier Settings** - iOS: Settings > General > About (prompt appears if update available) - Android: Settings > System > Advanced > System Update ### Slow Data Speeds #### Symptoms - Web pages loading slowly - Video buffering frequently - Apps timing out [continued]
Our Plan Advisor and Technical Support agent need the docs to be indexed in Pinecone which is our vector DB.
python ./scripts/setup_pinecone.py or uv run ./scripts/setup_pinecone.py
Once the agent is defined, the Galileo callback can be integrated with Langgraph in just a few lines of code.
from galileo import galileo_context from galileo.handlers.langchain import GalileoAsyncCallback # Initialize Galileo context first galileo_context.init() # Start Galileo session with unique session name galileo_context.start_session(name=session_name, external_id=cl.context.session.id) # Create the callback. This needs to be created in the same thread as the session # so that it uses the same session context. galileo_callback = GalileoAsyncCallback() # Pass the callback to the agent instance supervisor_agent.astream(input=messages, stream_mode="updates", config=RunnableConfig(callbacks=galileo_callback, **config))
Launch the Chainlit interface with our Langgraph application:
chainlit run app.py -w or uv run chainlit run app.py -w
The application will be available at http://localhost:8000. You'll see a chat interface where you can start asking questions about bills, technical issues, or plan recommendations.
Understanding the Core Components
The Supervisor Agent
The supervisor is the brain of our operation. It analyzes incoming queries and routes them to the right specialist. Here's how it works:
def create_supervisor_agent(): """ Create a supervisor agent that manages all the agents in the ConnectTel telecom application. """ checkpointer = MemorySaver() telecom_supervisor_agent = create_supervisor( model=ChatOpenAI(model=os.environ["MODEL_NAME_SUPERVISOR"], name="Supervisor"), agents=[ billing_account_agent, technical_support_agent, plan_advisor_agent, ], prompt=(""" You are a supervisor managing a team of specialized telecom service agents at ConnectTel. Route customer queries to the appropriate agent based on their needs: - Billing Account Agent: Bill inquiries, payment issues, usage tracking - Technical Support Agent: Device troubleshooting, connectivity issues - Plan Advisor Agent: Plan recommendations, upgrades, comparing plans Guidelines: - Route queries to the most appropriate specialist agent - For complex issues spanning multiple areas, coordinate between agents - Be helpful and empathetic to customer concerns """), add_handoff_back_messages=True, output_mode="full_history", supervisor_name="connecttel-supervisor-agent", ).compile(checkpointer=checkpointer) return telecom_supervisor_agent
The supervisor uses LangGraph's built-in create_supervisor
function which handles the complex orchestration logic. This includes message routing based on query analysis, state management across agent interactions, and memory persistence for conversation continuity.
Notice the add_handoff_back_messages=True
parameter. This allows agents to return control to the supervisor when they need help from another agent. It's like a customer service rep saying "Let me transfer you to billing" but happening seamlessly in the background.
Building Specialized Agents
Each specialized agent handles specific types of inquiries. They're built using LangGraph's create_react_agent
pattern, which implements reasoning and action cycles. Let's look at the Billing Account Agent:
from langchain_openai import ChatOpenAI from langgraph.prebuilt import create_react_agent from ..tools.billing_tool import BillingTool def create_billing_account_agent() -> CompiledGraph: """ Create an agent that handles billing inquiries, usage tracking, and account management. """ # Create the billing tool instance billing_tool = BillingTool() # Create a ReAct agent with specialized prompt agent = create_react_agent( model=ChatOpenAI( model=os.environ["MODEL_NAME_WORKER"], name="Billing Account Agent" ), tools=[billing_tool], prompt=(""" You are a Billing and Account specialist for ConnectTel. You help customers with billing inquiries, usage tracking, plan details, and payment issues. Key responsibilities: - Check account balances and payment due dates - Track data, voice, and text usage - Explain charges and fees clearly - Suggest plan optimizations based on usage - Process payment-related inquiries - Review billing history When discussing charges: - Break down costs clearly - Highlight any unusual charges - Suggest ways to reduce bills if usage patterns show opportunity - Always mention auto-pay discounts if not enrolled Be empathetic about high bills and offer solutions. """), name="billing-account-agent", ) return agent
The ReAct pattern means the agent reasons about when to use tools versus when to respond directly. The specialized prompt defines not just what the agent knows, but how it should behave. The empathy instructions are needed for scenarios where people might be frustrated about bills. But how does we measure the empathy? Galileo provides out of the box sentiment metric for measuring the tone.
Creating Effective Tools
Tools are where your agents interact with the real world. They're the bridge between conversation and action. We create a mock billing tool that responds as per the arguments query_type
and customer_id.
The responses are randomly generated and act like real responses for different users.
from typing import Optional from langchain.tools import BaseTool from datetime import datetime, timedelta import random class BillingTool(BaseTool): """Tool for retrieving customer billing and usage information.""" name: str = "billing_account" description: str = "Check account balance, data usage, plan details, and billing history" def _run(self, customer_id: Optional[str] = None, query_type: str = "summary") -> str: """ Get billing and usage information. Args: customer_id: Customer account number (uses default if not provided) query_type: Type of query (summary, usage, plan, history) """ # Mock customer data - in production, this would query real databases customer = { "name": "John Doe", "account": customer_id or "ACC-2024-789456", "plan": "Premium Unlimited 5G", "monthly_charge": 85.00, "data_used": random.uniform(20, 80), "data_limit": "Unlimited", "due_date": (datetime.now() + timedelta(days=15)).strftime("%Y-%m-%d") } if query_type == "usage": return f""" Usage Summary for {customer['name']}: - Data: {customer['data_used']:.1f} GB used ({customer['data_limit']}) - Minutes: {random.randint(300, 800)} (Unlimited) - Texts: {random.randint(500, 2000)} (Unlimited) - Average daily: {customer['data_used'] / 15:.2f} GB """ elif query_type == "plan": return f""" Current Plan: {customer['plan']} - Monthly Cost: ${customer['monthly_charge']:.2f} - Data: {customer['data_limit']} - Talk & Text: Unlimited - 5G Access: Included Available Upgrades: - Business Elite ($120/month) - International Plus ($95/month) """ elif query_type == "history": history = [] for i in range(3): date = (datetime.now() - timedelta(days=30*(i+1))).strftime("%Y-%m-%d") amount = customer['monthly_charge'] + random.uniform(-5, 15) history.append(f"- {date}: ${amount:.2f} (Paid)") return f""" Billing History: {chr(10).join(history)} Auto-pay: Enabled """ # Default summary response return f""" Account Summary for {customer['name']}: - Account: {customer['account']} - Plan: {customer['plan']} - Amount Due: ${customer['monthly_charge']:.2f} - Due Date: {customer['due_date']} - Data Used: {customer['data_used']:.1f} GB ({customer['data_limit']}) """
The tool uses mock data for demonstration purposes, but the underlying structure is ready. The key is the clear description field as it helps agents understand when to use the tool. The flexible parameters support various query types, and the formatted output is easily readable by humans.
How It Works in Practice
Let's walk through real conversations to see the system in action.
Example 1: Multi-Domain Query
When a user says, "My internet has been slow for the past week and I'm wondering if I'm being throttled because of my data usage. Can you check my current usage and bill?" here's what happens:

The supervisor first routes to the Billing Agent, which checks data usage and plan limits. Then it routes to the Technical Support Agent, which checks for connectivity issues. Finally, the supervisor combines both responses into a coherent answer.
Here is how it looks in the Galileo console, where we can see the traces, spans and metrics of the session.

Galileo automatically captures agent routing decisions, tool invocations with inputs and outputs, all LLM interactions and response times for each step.
Debugging Agents with Metrics
The real value of Galileo comes from understanding how your agents actually perform in production. Let's look at the key metrics.
Action Completion
Action Completion indicates whether your agents are actually assisting or merely responding. It's the difference between an agent that says "I'll check your bill" and one that actually retrieves and explains the charges.

Here's what makes an action complete in our telecom system. The agent must provide a complete answer (not just acknowledge the question), confirm successful actions when handling requests, maintain factual accuracy throughout, address every aspect of the user's query, avoid contradicting tool outputs, and properly summarize all relevant information from tools.

When a customer asks "Why is my bill so high this month?", a complete action retrieves the actual bill, identifies the specific charges causing the increase, compares it to previous months, and suggests ways to reduce it. That's what we're measuring.
A score below 80% usually indicates your agents are either not using their tools properly, providing generic responses instead of specific answers, or failing to follow through on multi-step processes. The beauty of tracking Action Completion is that it directly correlates with customer satisfaction.
Click "Configure Metrics" to open the Metrics Hub and enable the metrics.

Tool Selection Quality
Tool Selection Quality reveals whether your agents are choosing the right tools with the right parameters. It's like having a toolbox—knowing when to use a hammer versus a screwdriver makes all the difference.

If a customer asks "I want to upgrade my plan.", the Plan Advisor agent needs to select the right sequence of tools: first the billing tool to check current usage, then the plan comparison tool to find suitable upgrades, and finally the upgrade tool to process the change. If it skips straight to the upgrade tool without checking usage, it might recommend a plan that doesn't fit the customer's needs.
The metric evaluates two critical aspects: did the agent choose the correct tool, and did it use the correct parameters? In our system, we often see agents choose the right tool but set the parameters incorrectly.
A score below 80% is concerning. It means your agents are winging it instead of using their capabilities. The most common issue is that agents are either too eager or too reluctant to use tools. Too eager means calling tools for questions they can answer directly ("What are your business hours?"). Too reluctant means answering from general knowledge when they should check specific data.

Metric explanations help you quickly identify why your agent is underperforming. When you see low Tool Selection Quality scores, click into the generated reasoning to understand exactly what went wrong. Look for agents consistently choosing the wrong tool for certain query types. Then enhance your tool descriptions.
For example, in this case, the explanation reveals "transferring to the technical-support-agent here does not align well with the user's expressed needs," you've pinpointed a routing logic issue. Use these insights to refine your routing criteria to better match user intent patterns, add conditional checks that evaluate conversation context before transfers, or create intermediate clarification steps for ambiguous queries.

The explanations also highlight limited capabilities. If the reasoning notes "tools available only allow transferring to one of the three agents," you know you need to make architectural changes to allow parallel processing.
Latency Breakdown

The latency trace graph reveals the execution timeline and performance characteristics of our system. Understanding where time is spent helps optimize the user experience by finding the bottlenecks. Galileo’s latency trace graph provides several valuable insights into system behavior and performance:
1. Agent Coordination Patterns The trace reveals how agents collaborate in real-time. We can see the Supervisor consistently orchestrating the conversation while specialized agents (Billing, Technical Support, Plan Advisor) activate only when their expertise is needed. This validates the efficiency of our multi-agent architecture and confirms that agents aren't running unnecessarily.
2. Bottleneck Identification By examining the duration and frequency of each operation, we can pinpoint performance bottlenecks. The call_model
operations exhibit the most significant latency contributions, indicating that LLM inference is the primary factor affecting response time, rather than transfer logic or retrieval operations.
3. Decision Flow Understanding The should_continue
markers illustrate how the system makes routing decisions throughout the conversation. Multiple checks ensure the conversation flows appropriately between agents, and we can trace exactly when and why transfers occur.
4. Retrieval Timing The sparse appearance of pinecone_retrieval
operations shows that knowledge base queries are triggered selectively rather than on every turn, indicating intelligent retrieval logic that balances accuracy with performance.
5. System Responsiveness The overall timeline demonstrates that despite multiple agent handoffs and model calls, the system maintains reasonable end-to-end latency. This validates that our multi-agent approach doesn't introduce prohibitive overhead compared to a single-agent system.
Action items from Latency Trace graph:
Implement prompt caching if model calling operations consistently show high latency
Switch to faster models for routing/decision logic while keeping powerful models for final responses
Parallelize retrieval with other operations instead of running sequentially
These insights help diagnose issues in production and validate architectural choices in our conversational AI system.
Performance Benchmarks for Production
Once your multi-agent system moves from development to production, establishing clear performance targets becomes critical for maintaining user satisfaction and operational excellence. These benchmarks serve multiple purposes: they provide guardrails for deployment decisions, enable objective evaluation of system changes, and help teams identify when performance degradation requires immediate attention.
The metrics below are derived from real-world production data across multiple deployments and correlate directly with user satisfaction scores. Systems that consistently hit "Excellent" targets see significantly higher user retention and completion rates, while those falling into "Needs Improvement" typically generate support tickets and abandoned sessions.
Based on production deployments, here are the targets you should aim for. These are based on what users actually tolerate and what drives satisfaction scores.
Metric | Excellent | Good | Needs Improvement |
---|---|---|---|
Action Completion | > 95% | 85-95% | < 80% |
Tool Selection Quality | > 90% | 85-90% | < 85% |
Avg Response Time | < 2s | 2-4s | > 4s |
Supervisor Routing Accuracy | > 95% | 90-95% | < 90% |

Continuous Improvement with Insights
The moment your multi-agent system hits production, it encounters edge cases you never tested. A user asks "Why is my bill higher this month?" and gets transferred between Billing and Technical Support three times. Your logs show 47 LLM calls, 12 tool invocations, and 8 agent handoffs—but which one failed?
Traditional debugging forces you to:
Manually inspect hundreds of traces to find patterns
Guess whether the issue is in your supervisor's routing prompt, the Billing Agent's tool descriptions, Pinecone retrieval returning irrelevant context, or the fundamental agent architecture
Reproduce failures locally, which often behave differently than production
Wait until multiple users complain before you even know there's a problem
The Insights Engine automates this investigation. Instead of spending hours hunting through traces, Galileo analyzes your entire log stream and surfaces exactly what's broken.


Let's look at the two insights generated in our example and understand how they can guide us in making the agent reliable.
Context Memory Loss Detection: When Galileo identifies agents re-asking for information already provided (like the Plan Advisor requesting usage details after the Billing Agent just discussed them), it pinpoints exactly where conversation context breaks down. The insight shows you the specific span where memory was lost and provides a concrete fix: implement state persistence across agent handoffs or add a shared memory layer. This prevents frustrating user experiences where customers must repeat themselves.
Inefficient Tool Usage Patterns: The Multiple Retrieval Calls insight reveals when agents make redundant API calls that could be batched. Instead of manually reviewing hundreds of traces to find this pattern, Galileo shows you the exact sessions where the Plan Advisor queried the retrieval tool three separate times for different plan categories. The suggested action is immediate: refactor your tool calling logic to accept multiple query parameters or combine similar requests into a single retrieval operation, cutting API costs and reducing latency.
Each insight includes:
Timeline view showing when and how often the issue occurs
Example sessions with direct links to problematic traces
Impact analysis quantifying affected spans over the last two weeks
Use the "Refresh insights" button after deploying fixes to validate your improvements and track whether the issue frequency decreases.
Agent Observability in Production

Agent Quality
Agent Quality metrics measure metrics that directly impact user experience.
Action Completion tracks the percentage of user intents that successfully complete end-to-end. A 95% rate means that the vast majority of requests are fulfilled; however, the remaining 5% represents opportunities for improvement. Combined with Insights Engine, you can identify which specific actions fail most often and why.
Tool Selection Quality measures whether agents choose the right tools for each situation. The 98% score shown indicates highly accurate routing decisions like selecting the appropriate booking tool, knowledge base, or specialized agent for each user need. This is particularly critical for multi-agent systems, where incorrect routing can cascade into unnecessary handoffs and user frustration.
System Metrics
While agent quality measures "did it work correctly," system metrics track "did it work efficiently." These operational metrics ensure your system remains responsive and reliable as traffic scales.
Latency: Average response time reveals how long users wait. The 5.95 second average represents reasonable performance for complex multi-agent interactions. Context matters: simple greetings should respond in under 1 second, single-tool calls in 2-4 seconds, complex workflows in 8-10 seconds. Anything exceeding 15 seconds risks user abandonment.
API Failures: Tracks the reliability of external integrations. Zero failures indicate all API calls succeed. Even small increases deserve investigation as API failures cascade through workflows, causing incomplete actions. Use Insights Engine to identify which APIs fail and under what conditions.
Traces Count: Tracks conversation volume and usage patterns. Spikes indicate increased adoption; drops might signal access issues. If Action Completion drops when traces spike, you have scaling issues.
Custom Metrics for Business-Specific Insights
While Galileo provides comprehensive, out-of-the-box metrics, the real power lies in tracking what matters to your specific business. Let's create a custom metric to track something critical for telecom: unlimited plan recommendations.
In our system, we want to know when agents should be suggesting unlimited plans. Users approaching their data limits are prime candidates for upgrades, but are our agents catching these opportunities?

The LLM judge (GPT-4o in this case) assesses whether the agent correctly identified the opportunity and made an appropriate suggestion.
The evaluation criteria are straightforward:
Successful: The user clearly requested an unlimited plan, and the agent accurately suggested one
Fail: The user requested an unlimited plan, but the agent didn't suggest it or suggested something incorrect
Unrelated: The user's request wasn't about unlimited plans, or the response was unrelated
This metric becomes powerful when combined with business data. If we see high data usage customers not receiving unlimited plan suggestions (lots of "Fail" ratings), we know exactly where to improve our agent prompts. Maybe the billing agent needs clearer instructions about upgrade thresholds.
The beauty of custom metrics is they bridge the gap between generic performance tracking and specific business outcomes. You're not just measuring whether agents work but whether they drive the behaviors that matter to your business.

Looking at this real interaction, you can see the custom metric in action. When a user asks, "hi i am interested to know if you have unlimited plans" the system correctly routes through the Supervisor → Plan Advisor Agent.
Notice how our suggested_unlimited_plan
metric evaluates each step: the Supervisor gets marked as "Successful" for correctly routing to the Plan Advisor, the Plan Advisor gets "Successful" for providing relevant unlimited plan information, and the final Supervisor step is "Successful" for delivering the complete response. The agent responds with specific unlimited plan options and even guides the user to connect with the Plan Advisor Agent for personalized recommendations.
The metric confirms that our agents are accurately capturing the intent and responding appropriately. Over time, this data helps us understand conversion opportunities.
Making Observability Actionable
Raw metrics only matter if they drive action. The Trends dashboard transforms observability from passive monitoring into an active improvement system. Here's exactly how to use it:
Week 1: Establish Your Baseline
Run your chatbot for a week and document normal performance. For the Telecom Chatbot, that's:
90% Tool Selection Quality
95% Action Completion
~6 second latency
These numbers serve as your quality floor and any drop below this threshold triggers an investigation.
Daily: Catch Regressions Immediately
Check the 12H or 1D view each morning. If you see Action Completion suddenly drop from 95% to 85%:
Click the degraded time period to filter traces
Open Log Stream Insights to see which specific agent is failing
Investigate immediately and don't wait for customer complaints
Weekly: Identify Patterns and Plan Fixes
Switch to the 1W view for sprint planning:
Look for recurring patterns
Click into those traces to discover why users behave differently
Add training examples to address the pattern
Track whether the fix works by comparing next week's metrics
Monthly: Validate Major Changes
Before deploying major updates, snapshot your 1M metrics:
Baseline before change: Action Completion 95%, Latency 5.95s
After deployment: Check if metrics stayed stable or degraded
Document the impact: "Added Plan Advisor agent → metrics remained stable → validates architecture"
Configure Alerts

Configure these alerts with specific investigation steps:
Critical Alert: Action Completion < 90%
When triggered, check in this order:
Filter traces to the alert time window and sort by Action Completion score
Identify the failing agent: Is it Billing, Technical Support, or Plan Advisor?
Review Log Stream Insights for automated root cause analysis
Check the latency trace graph: Are tool calls timing out before completion?
Inspect failed tool calls: Are Pinecone retrievals returning empty results? Are transfer functions throwing errors?
Verify external dependencies: Is your knowledge base or CRM API down?
Immediate mitigation: If one agent is broken, route all traffic to working agents while you fix it
Warning Alert: Tool Selection Quality < 85%
Investigation steps:
Click the degraded time period to filter affected sessions
Check which tools are being selected incorrectly: Are users being sent to Technical Support when they need Billing?
Review the supervisor's routing decisions: Open 5-10 failed traces and read the should_continue reasoning
Look for new query patterns: Has user language changed? Are they asking about a new product you haven't trained for?
Compare successful vs. failed routing: What keywords appear in correctly routed queries that are missing in failures?
Fix: Update supervisor prompt with explicit examples of the new query patterns, or enhance tool descriptions with the missing keywords
Performance Alert: Latency > 8 seconds
Debugging checklist:
Switch to latency trace graph to identify the bottleneck operation
Check if LLM operations spiked: Did the LLM provider have an outage or slowdown?
Inspect retrieval latency: Are Pinecone queries taking longer than usual? Check their status page
Count sequential operations: Did a recent change add extra model calls or tool invocations?
Review Traces Count metric: Is high traffic causing queuing delays?
Check for retry loops: Are agents getting stuck in should_continue cycles, repeatedly checking transfer conditions?
Quick fix: Implement timeout thresholds to fail fast instead of waiting for slow operations
Long-term fix: Add prompt caching, parallelize retrieval with routing, or upgrade to faster model tiers
Reliability Alert: API Failures > 5%
Immediate actions:
Check the System Metrics panel to see which time periods have failures
Filter traces with errors and group by failure type
Identify the failing service: Pinecone retrieval? LLM API? Transfer functions?
Review error messages in failed spans: Are they rate limits, server errors, or timeouts?
Check your external service status pages: Is Pinecone, your LLM provider, or your internal CRM experiencing issues?
Review recent code deployments: Did you introduce a bug in tool calling logic?
Emergency response: Implement circuit breakers to stop calling failing services, show graceful error messages to users
Post-incident: Add retries with exponential backoff, or switch to backup services
Volume Alert: Traces Count > 1M/day
Capacity check:
Verify quality metrics remained stable: Did the traffic spike degrade Action Completion or Tool Selection?
Check latency trends: Is response time increasing with load?
Review API Failures: Are you hitting rate limits on external services?
Identify traffic source: Filter by user metadata to see if it's organic growth or a bot attack
If quality stayed high: Document this as validation that your system scales well
If quality degraded: Implement request throttling, add caching layers, or provision more resources
Close the loop after fixing an issue:
Monitor the same metric for 24-48 hours to confirm the fix worked
Update your alert thresholds if you've improved baseline performance
Document the incident with the symptom (what alert fired), root cause (what you found), and solution (what you changed)
Share learnings with your team so the next person can debug faster
Production observability only matters if it guides what you build. Use this dashboard daily to catch failures, weekly to identify patterns, monthly to validate improvements, and yearly to demonstrate that your agent system is creating business impact.
Wrapping Up
Our agent system demonstrates how modern AI orchestration creates sophisticated customer service solutions. By combining LangGraph's agent management with monitoring, you build systems that are both powerful and observable.
The architecture's modularity makes it particularly suitable for enterprises where different teams manage different aspects of customer service. Yes, there's added complexity and some latency overhead. But for production systems handling diverse queries, the benefits of specialization, scalability, and comprehensive monitoring can make it worthwhile.
The complete source code is available in the repository, ready for you to adapt to your own use case. Start with two agents, add monitoring from day one, and scale based on what you learn from real usage. That's how you build AI systems that actually work in production.
You can try Galileo for free to optimise your Langgraph with ease.
Get more insights on building agents in our in-depth eBook:
Choose the right agentic framework for your use case
Evaluate and improve AI agent performance
Identify failure points and production issues

We've all been trapped in a chatbot loop that keeps asking, "I'm sorry, I didn't understand that. Can you rephrase?" Or worse, a bot that cheerfully claims it can help with everything, then fails at everything. You ask about a billing error, but it suggests restarting your router.
Most chatbots attempt to be generalists, but they often end up being proficient in nothing. Real customer service works differently. When you call your telecom provider, you get routed to specialists who actually understand your specific problem.
That's exactly what we're building today! A multi-agent system for ConnectTel that creates delightful customer experiences through intelligent routing to specialized agents. We'll use LangGraph to build it and Galileo for real-time agent metrics, failure insights, and performance tracking so you can continuously improve what's actually working.
Welcome to the fifth installment of the multi-agent series. Don't miss out on other pieces in the Mastering Agent series.
Agent Architecture with Observability

Our system employs a supervisor pattern, where a main coordinator routes customer queries to specialized agents based on the customer's actual needs. We will leverage Chainlit for the conversational UI, Pinecone for vector-based knowledge retrieval, and GPT-4.1 as the underlying LLM.
The beauty of this design is that each agent can be developed, tested, and improved independently. When billing logic changes, you don't touch the technical support code. When you need to add a new capability, you add a new agent without rewriting everything.
However, we don't just want to build the agent; we also want to enable continuous improvement. This will require insights into the failures that happen when our users ask questions we never anticipated and expose edge cases. That is why we are building observability into our system from day one.
Here is how our continuous improvement cycle looks:
Monitor: Track every agent decision, tool call, and routing choice. You need to see not just what agents answer, but how they arrive at those answers. Which agent handled the query? What tools did they use? How long did each step take?
Debug: When things go wrong (and they will), trace through the entire chain. Did the supervisor route to the wrong agent? Did the tool return unexpected data? Did the agent misinterpret the response? Real production failures teach you more than any synthetic test.
Improve: Make targeted fixes based on actual data. Perhaps your billing agent requires more effective prompts for international charges. Technical support may need a timeout for slow diagnostics. The improvements can be measured and you'll know immediately if it worked.
This isn't a one-time optimization. It's an ongoing practice that separates production-ready systems from demos. The architecture we're about to build supports this naturally, with clear separation between agents and comprehensive tracking at every step.
Quick Setup Guide
Let's get this running on your machine. You'll need Python 3.9+, an OpenAI API key, Pinecone and Galileo accounts for the full experience. The complete source code is available in the repository, ready for you to adapt to your own use case.
Installation Steps
First, clone and navigate to the project:
git clone https://github.com/rungalileo/sdk-examples git checkout feature/langgraph-telecom-agent cd sdk-examples/python/agent/langgraph-telecom-agent
Configure your environment variables:
cp .env.example .env # Edit .env with your keys: OPENAI_API_KEY PINECONE_API_KEY GALILEO_API_KEY GALILEO_PROJECT="Multi-Agent Telecom Chatbot" GALILEO_LOG_STREAM="prod"
Install dependencies using uv (recommended) or pip:
# Using uv uv sync --dev # Or using pip pip install -e
In order for agent to act in line with the company guidelines and product, they need to be made aware of the right context. For this, we create a document with the guidelines.
This is the snippet from the troubleshooting guide, which will be chunked and indexed to be used by the Technical Support agent using retrieval.
# Network Troubleshooting Guide ## Common Network Issues and Solutions ### No Signal / No Service #### Symptoms - "No Service" or "Searching" message - Unable to make calls or send texts - No data connection #### Solutions 1. **Toggle Airplane Mode** - Turn ON for 30 seconds - Turn OFF and wait for reconnection 2. **Restart Device** - Power off completely - Wait 30 seconds - Power on 3. **Check SIM Card** - Remove and reinsert SIM - Clean SIM contacts with soft cloth - Try SIM in another device 4. **Reset Network Settings** - iOS: Settings > General > Reset > Reset Network Settings - Android: Settings > System > Reset > Reset Network Settings 5. **Update Carrier Settings** - iOS: Settings > General > About (prompt appears if update available) - Android: Settings > System > Advanced > System Update ### Slow Data Speeds #### Symptoms - Web pages loading slowly - Video buffering frequently - Apps timing out [continued]
Our Plan Advisor and Technical Support agent need the docs to be indexed in Pinecone which is our vector DB.
python ./scripts/setup_pinecone.py or uv run ./scripts/setup_pinecone.py
Once the agent is defined, the Galileo callback can be integrated with Langgraph in just a few lines of code.
from galileo import galileo_context from galileo.handlers.langchain import GalileoAsyncCallback # Initialize Galileo context first galileo_context.init() # Start Galileo session with unique session name galileo_context.start_session(name=session_name, external_id=cl.context.session.id) # Create the callback. This needs to be created in the same thread as the session # so that it uses the same session context. galileo_callback = GalileoAsyncCallback() # Pass the callback to the agent instance supervisor_agent.astream(input=messages, stream_mode="updates", config=RunnableConfig(callbacks=galileo_callback, **config))
Launch the Chainlit interface with our Langgraph application:
chainlit run app.py -w or uv run chainlit run app.py -w
The application will be available at http://localhost:8000. You'll see a chat interface where you can start asking questions about bills, technical issues, or plan recommendations.
Understanding the Core Components
The Supervisor Agent
The supervisor is the brain of our operation. It analyzes incoming queries and routes them to the right specialist. Here's how it works:
def create_supervisor_agent(): """ Create a supervisor agent that manages all the agents in the ConnectTel telecom application. """ checkpointer = MemorySaver() telecom_supervisor_agent = create_supervisor( model=ChatOpenAI(model=os.environ["MODEL_NAME_SUPERVISOR"], name="Supervisor"), agents=[ billing_account_agent, technical_support_agent, plan_advisor_agent, ], prompt=(""" You are a supervisor managing a team of specialized telecom service agents at ConnectTel. Route customer queries to the appropriate agent based on their needs: - Billing Account Agent: Bill inquiries, payment issues, usage tracking - Technical Support Agent: Device troubleshooting, connectivity issues - Plan Advisor Agent: Plan recommendations, upgrades, comparing plans Guidelines: - Route queries to the most appropriate specialist agent - For complex issues spanning multiple areas, coordinate between agents - Be helpful and empathetic to customer concerns """), add_handoff_back_messages=True, output_mode="full_history", supervisor_name="connecttel-supervisor-agent", ).compile(checkpointer=checkpointer) return telecom_supervisor_agent
The supervisor uses LangGraph's built-in create_supervisor
function which handles the complex orchestration logic. This includes message routing based on query analysis, state management across agent interactions, and memory persistence for conversation continuity.
Notice the add_handoff_back_messages=True
parameter. This allows agents to return control to the supervisor when they need help from another agent. It's like a customer service rep saying "Let me transfer you to billing" but happening seamlessly in the background.
Building Specialized Agents
Each specialized agent handles specific types of inquiries. They're built using LangGraph's create_react_agent
pattern, which implements reasoning and action cycles. Let's look at the Billing Account Agent:
from langchain_openai import ChatOpenAI from langgraph.prebuilt import create_react_agent from ..tools.billing_tool import BillingTool def create_billing_account_agent() -> CompiledGraph: """ Create an agent that handles billing inquiries, usage tracking, and account management. """ # Create the billing tool instance billing_tool = BillingTool() # Create a ReAct agent with specialized prompt agent = create_react_agent( model=ChatOpenAI( model=os.environ["MODEL_NAME_WORKER"], name="Billing Account Agent" ), tools=[billing_tool], prompt=(""" You are a Billing and Account specialist for ConnectTel. You help customers with billing inquiries, usage tracking, plan details, and payment issues. Key responsibilities: - Check account balances and payment due dates - Track data, voice, and text usage - Explain charges and fees clearly - Suggest plan optimizations based on usage - Process payment-related inquiries - Review billing history When discussing charges: - Break down costs clearly - Highlight any unusual charges - Suggest ways to reduce bills if usage patterns show opportunity - Always mention auto-pay discounts if not enrolled Be empathetic about high bills and offer solutions. """), name="billing-account-agent", ) return agent
The ReAct pattern means the agent reasons about when to use tools versus when to respond directly. The specialized prompt defines not just what the agent knows, but how it should behave. The empathy instructions are needed for scenarios where people might be frustrated about bills. But how does we measure the empathy? Galileo provides out of the box sentiment metric for measuring the tone.
Creating Effective Tools
Tools are where your agents interact with the real world. They're the bridge between conversation and action. We create a mock billing tool that responds as per the arguments query_type
and customer_id.
The responses are randomly generated and act like real responses for different users.
from typing import Optional from langchain.tools import BaseTool from datetime import datetime, timedelta import random class BillingTool(BaseTool): """Tool for retrieving customer billing and usage information.""" name: str = "billing_account" description: str = "Check account balance, data usage, plan details, and billing history" def _run(self, customer_id: Optional[str] = None, query_type: str = "summary") -> str: """ Get billing and usage information. Args: customer_id: Customer account number (uses default if not provided) query_type: Type of query (summary, usage, plan, history) """ # Mock customer data - in production, this would query real databases customer = { "name": "John Doe", "account": customer_id or "ACC-2024-789456", "plan": "Premium Unlimited 5G", "monthly_charge": 85.00, "data_used": random.uniform(20, 80), "data_limit": "Unlimited", "due_date": (datetime.now() + timedelta(days=15)).strftime("%Y-%m-%d") } if query_type == "usage": return f""" Usage Summary for {customer['name']}: - Data: {customer['data_used']:.1f} GB used ({customer['data_limit']}) - Minutes: {random.randint(300, 800)} (Unlimited) - Texts: {random.randint(500, 2000)} (Unlimited) - Average daily: {customer['data_used'] / 15:.2f} GB """ elif query_type == "plan": return f""" Current Plan: {customer['plan']} - Monthly Cost: ${customer['monthly_charge']:.2f} - Data: {customer['data_limit']} - Talk & Text: Unlimited - 5G Access: Included Available Upgrades: - Business Elite ($120/month) - International Plus ($95/month) """ elif query_type == "history": history = [] for i in range(3): date = (datetime.now() - timedelta(days=30*(i+1))).strftime("%Y-%m-%d") amount = customer['monthly_charge'] + random.uniform(-5, 15) history.append(f"- {date}: ${amount:.2f} (Paid)") return f""" Billing History: {chr(10).join(history)} Auto-pay: Enabled """ # Default summary response return f""" Account Summary for {customer['name']}: - Account: {customer['account']} - Plan: {customer['plan']} - Amount Due: ${customer['monthly_charge']:.2f} - Due Date: {customer['due_date']} - Data Used: {customer['data_used']:.1f} GB ({customer['data_limit']}) """
The tool uses mock data for demonstration purposes, but the underlying structure is ready. The key is the clear description field as it helps agents understand when to use the tool. The flexible parameters support various query types, and the formatted output is easily readable by humans.
How It Works in Practice
Let's walk through real conversations to see the system in action.
Example 1: Multi-Domain Query
When a user says, "My internet has been slow for the past week and I'm wondering if I'm being throttled because of my data usage. Can you check my current usage and bill?" here's what happens:

The supervisor first routes to the Billing Agent, which checks data usage and plan limits. Then it routes to the Technical Support Agent, which checks for connectivity issues. Finally, the supervisor combines both responses into a coherent answer.
Here is how it looks in the Galileo console, where we can see the traces, spans and metrics of the session.

Galileo automatically captures agent routing decisions, tool invocations with inputs and outputs, all LLM interactions and response times for each step.
Debugging Agents with Metrics
The real value of Galileo comes from understanding how your agents actually perform in production. Let's look at the key metrics.
Action Completion
Action Completion indicates whether your agents are actually assisting or merely responding. It's the difference between an agent that says "I'll check your bill" and one that actually retrieves and explains the charges.

Here's what makes an action complete in our telecom system. The agent must provide a complete answer (not just acknowledge the question), confirm successful actions when handling requests, maintain factual accuracy throughout, address every aspect of the user's query, avoid contradicting tool outputs, and properly summarize all relevant information from tools.

When a customer asks "Why is my bill so high this month?", a complete action retrieves the actual bill, identifies the specific charges causing the increase, compares it to previous months, and suggests ways to reduce it. That's what we're measuring.
A score below 80% usually indicates your agents are either not using their tools properly, providing generic responses instead of specific answers, or failing to follow through on multi-step processes. The beauty of tracking Action Completion is that it directly correlates with customer satisfaction.
Click "Configure Metrics" to open the Metrics Hub and enable the metrics.

Tool Selection Quality
Tool Selection Quality reveals whether your agents are choosing the right tools with the right parameters. It's like having a toolbox—knowing when to use a hammer versus a screwdriver makes all the difference.

If a customer asks "I want to upgrade my plan.", the Plan Advisor agent needs to select the right sequence of tools: first the billing tool to check current usage, then the plan comparison tool to find suitable upgrades, and finally the upgrade tool to process the change. If it skips straight to the upgrade tool without checking usage, it might recommend a plan that doesn't fit the customer's needs.
The metric evaluates two critical aspects: did the agent choose the correct tool, and did it use the correct parameters? In our system, we often see agents choose the right tool but set the parameters incorrectly.
A score below 80% is concerning. It means your agents are winging it instead of using their capabilities. The most common issue is that agents are either too eager or too reluctant to use tools. Too eager means calling tools for questions they can answer directly ("What are your business hours?"). Too reluctant means answering from general knowledge when they should check specific data.

Metric explanations help you quickly identify why your agent is underperforming. When you see low Tool Selection Quality scores, click into the generated reasoning to understand exactly what went wrong. Look for agents consistently choosing the wrong tool for certain query types. Then enhance your tool descriptions.
For example, in this case, the explanation reveals "transferring to the technical-support-agent here does not align well with the user's expressed needs," you've pinpointed a routing logic issue. Use these insights to refine your routing criteria to better match user intent patterns, add conditional checks that evaluate conversation context before transfers, or create intermediate clarification steps for ambiguous queries.

The explanations also highlight limited capabilities. If the reasoning notes "tools available only allow transferring to one of the three agents," you know you need to make architectural changes to allow parallel processing.
Latency Breakdown

The latency trace graph reveals the execution timeline and performance characteristics of our system. Understanding where time is spent helps optimize the user experience by finding the bottlenecks. Galileo’s latency trace graph provides several valuable insights into system behavior and performance:
1. Agent Coordination Patterns The trace reveals how agents collaborate in real-time. We can see the Supervisor consistently orchestrating the conversation while specialized agents (Billing, Technical Support, Plan Advisor) activate only when their expertise is needed. This validates the efficiency of our multi-agent architecture and confirms that agents aren't running unnecessarily.
2. Bottleneck Identification By examining the duration and frequency of each operation, we can pinpoint performance bottlenecks. The call_model
operations exhibit the most significant latency contributions, indicating that LLM inference is the primary factor affecting response time, rather than transfer logic or retrieval operations.
3. Decision Flow Understanding The should_continue
markers illustrate how the system makes routing decisions throughout the conversation. Multiple checks ensure the conversation flows appropriately between agents, and we can trace exactly when and why transfers occur.
4. Retrieval Timing The sparse appearance of pinecone_retrieval
operations shows that knowledge base queries are triggered selectively rather than on every turn, indicating intelligent retrieval logic that balances accuracy with performance.
5. System Responsiveness The overall timeline demonstrates that despite multiple agent handoffs and model calls, the system maintains reasonable end-to-end latency. This validates that our multi-agent approach doesn't introduce prohibitive overhead compared to a single-agent system.
Action items from Latency Trace graph:
Implement prompt caching if model calling operations consistently show high latency
Switch to faster models for routing/decision logic while keeping powerful models for final responses
Parallelize retrieval with other operations instead of running sequentially
These insights help diagnose issues in production and validate architectural choices in our conversational AI system.
Performance Benchmarks for Production
Once your multi-agent system moves from development to production, establishing clear performance targets becomes critical for maintaining user satisfaction and operational excellence. These benchmarks serve multiple purposes: they provide guardrails for deployment decisions, enable objective evaluation of system changes, and help teams identify when performance degradation requires immediate attention.
The metrics below are derived from real-world production data across multiple deployments and correlate directly with user satisfaction scores. Systems that consistently hit "Excellent" targets see significantly higher user retention and completion rates, while those falling into "Needs Improvement" typically generate support tickets and abandoned sessions.
Based on production deployments, here are the targets you should aim for. These are based on what users actually tolerate and what drives satisfaction scores.
Metric | Excellent | Good | Needs Improvement |
---|---|---|---|
Action Completion | > 95% | 85-95% | < 80% |
Tool Selection Quality | > 90% | 85-90% | < 85% |
Avg Response Time | < 2s | 2-4s | > 4s |
Supervisor Routing Accuracy | > 95% | 90-95% | < 90% |

Continuous Improvement with Insights
The moment your multi-agent system hits production, it encounters edge cases you never tested. A user asks "Why is my bill higher this month?" and gets transferred between Billing and Technical Support three times. Your logs show 47 LLM calls, 12 tool invocations, and 8 agent handoffs—but which one failed?
Traditional debugging forces you to:
Manually inspect hundreds of traces to find patterns
Guess whether the issue is in your supervisor's routing prompt, the Billing Agent's tool descriptions, Pinecone retrieval returning irrelevant context, or the fundamental agent architecture
Reproduce failures locally, which often behave differently than production
Wait until multiple users complain before you even know there's a problem
The Insights Engine automates this investigation. Instead of spending hours hunting through traces, Galileo analyzes your entire log stream and surfaces exactly what's broken.


Let's look at the two insights generated in our example and understand how they can guide us in making the agent reliable.
Context Memory Loss Detection: When Galileo identifies agents re-asking for information already provided (like the Plan Advisor requesting usage details after the Billing Agent just discussed them), it pinpoints exactly where conversation context breaks down. The insight shows you the specific span where memory was lost and provides a concrete fix: implement state persistence across agent handoffs or add a shared memory layer. This prevents frustrating user experiences where customers must repeat themselves.
Inefficient Tool Usage Patterns: The Multiple Retrieval Calls insight reveals when agents make redundant API calls that could be batched. Instead of manually reviewing hundreds of traces to find this pattern, Galileo shows you the exact sessions where the Plan Advisor queried the retrieval tool three separate times for different plan categories. The suggested action is immediate: refactor your tool calling logic to accept multiple query parameters or combine similar requests into a single retrieval operation, cutting API costs and reducing latency.
Each insight includes:
Timeline view showing when and how often the issue occurs
Example sessions with direct links to problematic traces
Impact analysis quantifying affected spans over the last two weeks
Use the "Refresh insights" button after deploying fixes to validate your improvements and track whether the issue frequency decreases.
Agent Observability in Production

Agent Quality
Agent Quality metrics measure metrics that directly impact user experience.
Action Completion tracks the percentage of user intents that successfully complete end-to-end. A 95% rate means that the vast majority of requests are fulfilled; however, the remaining 5% represents opportunities for improvement. Combined with Insights Engine, you can identify which specific actions fail most often and why.
Tool Selection Quality measures whether agents choose the right tools for each situation. The 98% score shown indicates highly accurate routing decisions like selecting the appropriate booking tool, knowledge base, or specialized agent for each user need. This is particularly critical for multi-agent systems, where incorrect routing can cascade into unnecessary handoffs and user frustration.
System Metrics
While agent quality measures "did it work correctly," system metrics track "did it work efficiently." These operational metrics ensure your system remains responsive and reliable as traffic scales.
Latency: Average response time reveals how long users wait. The 5.95 second average represents reasonable performance for complex multi-agent interactions. Context matters: simple greetings should respond in under 1 second, single-tool calls in 2-4 seconds, complex workflows in 8-10 seconds. Anything exceeding 15 seconds risks user abandonment.
API Failures: Tracks the reliability of external integrations. Zero failures indicate all API calls succeed. Even small increases deserve investigation as API failures cascade through workflows, causing incomplete actions. Use Insights Engine to identify which APIs fail and under what conditions.
Traces Count: Tracks conversation volume and usage patterns. Spikes indicate increased adoption; drops might signal access issues. If Action Completion drops when traces spike, you have scaling issues.
Custom Metrics for Business-Specific Insights
While Galileo provides comprehensive, out-of-the-box metrics, the real power lies in tracking what matters to your specific business. Let's create a custom metric to track something critical for telecom: unlimited plan recommendations.
In our system, we want to know when agents should be suggesting unlimited plans. Users approaching their data limits are prime candidates for upgrades, but are our agents catching these opportunities?

The LLM judge (GPT-4o in this case) assesses whether the agent correctly identified the opportunity and made an appropriate suggestion.
The evaluation criteria are straightforward:
Successful: The user clearly requested an unlimited plan, and the agent accurately suggested one
Fail: The user requested an unlimited plan, but the agent didn't suggest it or suggested something incorrect
Unrelated: The user's request wasn't about unlimited plans, or the response was unrelated
This metric becomes powerful when combined with business data. If we see high data usage customers not receiving unlimited plan suggestions (lots of "Fail" ratings), we know exactly where to improve our agent prompts. Maybe the billing agent needs clearer instructions about upgrade thresholds.
The beauty of custom metrics is they bridge the gap between generic performance tracking and specific business outcomes. You're not just measuring whether agents work but whether they drive the behaviors that matter to your business.

Looking at this real interaction, you can see the custom metric in action. When a user asks, "hi i am interested to know if you have unlimited plans" the system correctly routes through the Supervisor → Plan Advisor Agent.
Notice how our suggested_unlimited_plan
metric evaluates each step: the Supervisor gets marked as "Successful" for correctly routing to the Plan Advisor, the Plan Advisor gets "Successful" for providing relevant unlimited plan information, and the final Supervisor step is "Successful" for delivering the complete response. The agent responds with specific unlimited plan options and even guides the user to connect with the Plan Advisor Agent for personalized recommendations.
The metric confirms that our agents are accurately capturing the intent and responding appropriately. Over time, this data helps us understand conversion opportunities.
Making Observability Actionable
Raw metrics only matter if they drive action. The Trends dashboard transforms observability from passive monitoring into an active improvement system. Here's exactly how to use it:
Week 1: Establish Your Baseline
Run your chatbot for a week and document normal performance. For the Telecom Chatbot, that's:
90% Tool Selection Quality
95% Action Completion
~6 second latency
These numbers serve as your quality floor and any drop below this threshold triggers an investigation.
Daily: Catch Regressions Immediately
Check the 12H or 1D view each morning. If you see Action Completion suddenly drop from 95% to 85%:
Click the degraded time period to filter traces
Open Log Stream Insights to see which specific agent is failing
Investigate immediately and don't wait for customer complaints
Weekly: Identify Patterns and Plan Fixes
Switch to the 1W view for sprint planning:
Look for recurring patterns
Click into those traces to discover why users behave differently
Add training examples to address the pattern
Track whether the fix works by comparing next week's metrics
Monthly: Validate Major Changes
Before deploying major updates, snapshot your 1M metrics:
Baseline before change: Action Completion 95%, Latency 5.95s
After deployment: Check if metrics stayed stable or degraded
Document the impact: "Added Plan Advisor agent → metrics remained stable → validates architecture"
Configure Alerts

Configure these alerts with specific investigation steps:
Critical Alert: Action Completion < 90%
When triggered, check in this order:
Filter traces to the alert time window and sort by Action Completion score
Identify the failing agent: Is it Billing, Technical Support, or Plan Advisor?
Review Log Stream Insights for automated root cause analysis
Check the latency trace graph: Are tool calls timing out before completion?
Inspect failed tool calls: Are Pinecone retrievals returning empty results? Are transfer functions throwing errors?
Verify external dependencies: Is your knowledge base or CRM API down?
Immediate mitigation: If one agent is broken, route all traffic to working agents while you fix it
Warning Alert: Tool Selection Quality < 85%
Investigation steps:
Click the degraded time period to filter affected sessions
Check which tools are being selected incorrectly: Are users being sent to Technical Support when they need Billing?
Review the supervisor's routing decisions: Open 5-10 failed traces and read the should_continue reasoning
Look for new query patterns: Has user language changed? Are they asking about a new product you haven't trained for?
Compare successful vs. failed routing: What keywords appear in correctly routed queries that are missing in failures?
Fix: Update supervisor prompt with explicit examples of the new query patterns, or enhance tool descriptions with the missing keywords
Performance Alert: Latency > 8 seconds
Debugging checklist:
Switch to latency trace graph to identify the bottleneck operation
Check if LLM operations spiked: Did the LLM provider have an outage or slowdown?
Inspect retrieval latency: Are Pinecone queries taking longer than usual? Check their status page
Count sequential operations: Did a recent change add extra model calls or tool invocations?
Review Traces Count metric: Is high traffic causing queuing delays?
Check for retry loops: Are agents getting stuck in should_continue cycles, repeatedly checking transfer conditions?
Quick fix: Implement timeout thresholds to fail fast instead of waiting for slow operations
Long-term fix: Add prompt caching, parallelize retrieval with routing, or upgrade to faster model tiers
Reliability Alert: API Failures > 5%
Immediate actions:
Check the System Metrics panel to see which time periods have failures
Filter traces with errors and group by failure type
Identify the failing service: Pinecone retrieval? LLM API? Transfer functions?
Review error messages in failed spans: Are they rate limits, server errors, or timeouts?
Check your external service status pages: Is Pinecone, your LLM provider, or your internal CRM experiencing issues?
Review recent code deployments: Did you introduce a bug in tool calling logic?
Emergency response: Implement circuit breakers to stop calling failing services, show graceful error messages to users
Post-incident: Add retries with exponential backoff, or switch to backup services
Volume Alert: Traces Count > 1M/day
Capacity check:
Verify quality metrics remained stable: Did the traffic spike degrade Action Completion or Tool Selection?
Check latency trends: Is response time increasing with load?
Review API Failures: Are you hitting rate limits on external services?
Identify traffic source: Filter by user metadata to see if it's organic growth or a bot attack
If quality stayed high: Document this as validation that your system scales well
If quality degraded: Implement request throttling, add caching layers, or provision more resources
Close the loop after fixing an issue:
Monitor the same metric for 24-48 hours to confirm the fix worked
Update your alert thresholds if you've improved baseline performance
Document the incident with the symptom (what alert fired), root cause (what you found), and solution (what you changed)
Share learnings with your team so the next person can debug faster
Production observability only matters if it guides what you build. Use this dashboard daily to catch failures, weekly to identify patterns, monthly to validate improvements, and yearly to demonstrate that your agent system is creating business impact.
Wrapping Up
Our agent system demonstrates how modern AI orchestration creates sophisticated customer service solutions. By combining LangGraph's agent management with monitoring, you build systems that are both powerful and observable.
The architecture's modularity makes it particularly suitable for enterprises where different teams manage different aspects of customer service. Yes, there's added complexity and some latency overhead. But for production systems handling diverse queries, the benefits of specialization, scalability, and comprehensive monitoring can make it worthwhile.
The complete source code is available in the repository, ready for you to adapt to your own use case. Start with two agents, add monitoring from day one, and scale based on what you learn from real usage. That's how you build AI systems that actually work in production.
You can try Galileo for free to optimise your Langgraph with ease.
Get more insights on building agents in our in-depth eBook:
Choose the right agentic framework for your use case
Evaluate and improve AI agent performance
Identify failure points and production issues

We've all been trapped in a chatbot loop that keeps asking, "I'm sorry, I didn't understand that. Can you rephrase?" Or worse, a bot that cheerfully claims it can help with everything, then fails at everything. You ask about a billing error, but it suggests restarting your router.
Most chatbots attempt to be generalists, but they often end up being proficient in nothing. Real customer service works differently. When you call your telecom provider, you get routed to specialists who actually understand your specific problem.
That's exactly what we're building today! A multi-agent system for ConnectTel that creates delightful customer experiences through intelligent routing to specialized agents. We'll use LangGraph to build it and Galileo for real-time agent metrics, failure insights, and performance tracking so you can continuously improve what's actually working.
Welcome to the fifth installment of the multi-agent series. Don't miss out on other pieces in the Mastering Agent series.
Agent Architecture with Observability

Our system employs a supervisor pattern, where a main coordinator routes customer queries to specialized agents based on the customer's actual needs. We will leverage Chainlit for the conversational UI, Pinecone for vector-based knowledge retrieval, and GPT-4.1 as the underlying LLM.
The beauty of this design is that each agent can be developed, tested, and improved independently. When billing logic changes, you don't touch the technical support code. When you need to add a new capability, you add a new agent without rewriting everything.
However, we don't just want to build the agent; we also want to enable continuous improvement. This will require insights into the failures that happen when our users ask questions we never anticipated and expose edge cases. That is why we are building observability into our system from day one.
Here is how our continuous improvement cycle looks:
Monitor: Track every agent decision, tool call, and routing choice. You need to see not just what agents answer, but how they arrive at those answers. Which agent handled the query? What tools did they use? How long did each step take?
Debug: When things go wrong (and they will), trace through the entire chain. Did the supervisor route to the wrong agent? Did the tool return unexpected data? Did the agent misinterpret the response? Real production failures teach you more than any synthetic test.
Improve: Make targeted fixes based on actual data. Perhaps your billing agent requires more effective prompts for international charges. Technical support may need a timeout for slow diagnostics. The improvements can be measured and you'll know immediately if it worked.
This isn't a one-time optimization. It's an ongoing practice that separates production-ready systems from demos. The architecture we're about to build supports this naturally, with clear separation between agents and comprehensive tracking at every step.
Quick Setup Guide
Let's get this running on your machine. You'll need Python 3.9+, an OpenAI API key, Pinecone and Galileo accounts for the full experience. The complete source code is available in the repository, ready for you to adapt to your own use case.
Installation Steps
First, clone and navigate to the project:
git clone https://github.com/rungalileo/sdk-examples git checkout feature/langgraph-telecom-agent cd sdk-examples/python/agent/langgraph-telecom-agent
Configure your environment variables:
cp .env.example .env # Edit .env with your keys: OPENAI_API_KEY PINECONE_API_KEY GALILEO_API_KEY GALILEO_PROJECT="Multi-Agent Telecom Chatbot" GALILEO_LOG_STREAM="prod"
Install dependencies using uv (recommended) or pip:
# Using uv uv sync --dev # Or using pip pip install -e
In order for agent to act in line with the company guidelines and product, they need to be made aware of the right context. For this, we create a document with the guidelines.
This is the snippet from the troubleshooting guide, which will be chunked and indexed to be used by the Technical Support agent using retrieval.
# Network Troubleshooting Guide ## Common Network Issues and Solutions ### No Signal / No Service #### Symptoms - "No Service" or "Searching" message - Unable to make calls or send texts - No data connection #### Solutions 1. **Toggle Airplane Mode** - Turn ON for 30 seconds - Turn OFF and wait for reconnection 2. **Restart Device** - Power off completely - Wait 30 seconds - Power on 3. **Check SIM Card** - Remove and reinsert SIM - Clean SIM contacts with soft cloth - Try SIM in another device 4. **Reset Network Settings** - iOS: Settings > General > Reset > Reset Network Settings - Android: Settings > System > Reset > Reset Network Settings 5. **Update Carrier Settings** - iOS: Settings > General > About (prompt appears if update available) - Android: Settings > System > Advanced > System Update ### Slow Data Speeds #### Symptoms - Web pages loading slowly - Video buffering frequently - Apps timing out [continued]
Our Plan Advisor and Technical Support agent need the docs to be indexed in Pinecone which is our vector DB.
python ./scripts/setup_pinecone.py or uv run ./scripts/setup_pinecone.py
Once the agent is defined, the Galileo callback can be integrated with Langgraph in just a few lines of code.
from galileo import galileo_context from galileo.handlers.langchain import GalileoAsyncCallback # Initialize Galileo context first galileo_context.init() # Start Galileo session with unique session name galileo_context.start_session(name=session_name, external_id=cl.context.session.id) # Create the callback. This needs to be created in the same thread as the session # so that it uses the same session context. galileo_callback = GalileoAsyncCallback() # Pass the callback to the agent instance supervisor_agent.astream(input=messages, stream_mode="updates", config=RunnableConfig(callbacks=galileo_callback, **config))
Launch the Chainlit interface with our Langgraph application:
chainlit run app.py -w or uv run chainlit run app.py -w
The application will be available at http://localhost:8000. You'll see a chat interface where you can start asking questions about bills, technical issues, or plan recommendations.
Understanding the Core Components
The Supervisor Agent
The supervisor is the brain of our operation. It analyzes incoming queries and routes them to the right specialist. Here's how it works:
def create_supervisor_agent(): """ Create a supervisor agent that manages all the agents in the ConnectTel telecom application. """ checkpointer = MemorySaver() telecom_supervisor_agent = create_supervisor( model=ChatOpenAI(model=os.environ["MODEL_NAME_SUPERVISOR"], name="Supervisor"), agents=[ billing_account_agent, technical_support_agent, plan_advisor_agent, ], prompt=(""" You are a supervisor managing a team of specialized telecom service agents at ConnectTel. Route customer queries to the appropriate agent based on their needs: - Billing Account Agent: Bill inquiries, payment issues, usage tracking - Technical Support Agent: Device troubleshooting, connectivity issues - Plan Advisor Agent: Plan recommendations, upgrades, comparing plans Guidelines: - Route queries to the most appropriate specialist agent - For complex issues spanning multiple areas, coordinate between agents - Be helpful and empathetic to customer concerns """), add_handoff_back_messages=True, output_mode="full_history", supervisor_name="connecttel-supervisor-agent", ).compile(checkpointer=checkpointer) return telecom_supervisor_agent
The supervisor uses LangGraph's built-in create_supervisor
function which handles the complex orchestration logic. This includes message routing based on query analysis, state management across agent interactions, and memory persistence for conversation continuity.
Notice the add_handoff_back_messages=True
parameter. This allows agents to return control to the supervisor when they need help from another agent. It's like a customer service rep saying "Let me transfer you to billing" but happening seamlessly in the background.
Building Specialized Agents
Each specialized agent handles specific types of inquiries. They're built using LangGraph's create_react_agent
pattern, which implements reasoning and action cycles. Let's look at the Billing Account Agent:
from langchain_openai import ChatOpenAI from langgraph.prebuilt import create_react_agent from ..tools.billing_tool import BillingTool def create_billing_account_agent() -> CompiledGraph: """ Create an agent that handles billing inquiries, usage tracking, and account management. """ # Create the billing tool instance billing_tool = BillingTool() # Create a ReAct agent with specialized prompt agent = create_react_agent( model=ChatOpenAI( model=os.environ["MODEL_NAME_WORKER"], name="Billing Account Agent" ), tools=[billing_tool], prompt=(""" You are a Billing and Account specialist for ConnectTel. You help customers with billing inquiries, usage tracking, plan details, and payment issues. Key responsibilities: - Check account balances and payment due dates - Track data, voice, and text usage - Explain charges and fees clearly - Suggest plan optimizations based on usage - Process payment-related inquiries - Review billing history When discussing charges: - Break down costs clearly - Highlight any unusual charges - Suggest ways to reduce bills if usage patterns show opportunity - Always mention auto-pay discounts if not enrolled Be empathetic about high bills and offer solutions. """), name="billing-account-agent", ) return agent
The ReAct pattern means the agent reasons about when to use tools versus when to respond directly. The specialized prompt defines not just what the agent knows, but how it should behave. The empathy instructions are needed for scenarios where people might be frustrated about bills. But how does we measure the empathy? Galileo provides out of the box sentiment metric for measuring the tone.
Creating Effective Tools
Tools are where your agents interact with the real world. They're the bridge between conversation and action. We create a mock billing tool that responds as per the arguments query_type
and customer_id.
The responses are randomly generated and act like real responses for different users.
from typing import Optional from langchain.tools import BaseTool from datetime import datetime, timedelta import random class BillingTool(BaseTool): """Tool for retrieving customer billing and usage information.""" name: str = "billing_account" description: str = "Check account balance, data usage, plan details, and billing history" def _run(self, customer_id: Optional[str] = None, query_type: str = "summary") -> str: """ Get billing and usage information. Args: customer_id: Customer account number (uses default if not provided) query_type: Type of query (summary, usage, plan, history) """ # Mock customer data - in production, this would query real databases customer = { "name": "John Doe", "account": customer_id or "ACC-2024-789456", "plan": "Premium Unlimited 5G", "monthly_charge": 85.00, "data_used": random.uniform(20, 80), "data_limit": "Unlimited", "due_date": (datetime.now() + timedelta(days=15)).strftime("%Y-%m-%d") } if query_type == "usage": return f""" Usage Summary for {customer['name']}: - Data: {customer['data_used']:.1f} GB used ({customer['data_limit']}) - Minutes: {random.randint(300, 800)} (Unlimited) - Texts: {random.randint(500, 2000)} (Unlimited) - Average daily: {customer['data_used'] / 15:.2f} GB """ elif query_type == "plan": return f""" Current Plan: {customer['plan']} - Monthly Cost: ${customer['monthly_charge']:.2f} - Data: {customer['data_limit']} - Talk & Text: Unlimited - 5G Access: Included Available Upgrades: - Business Elite ($120/month) - International Plus ($95/month) """ elif query_type == "history": history = [] for i in range(3): date = (datetime.now() - timedelta(days=30*(i+1))).strftime("%Y-%m-%d") amount = customer['monthly_charge'] + random.uniform(-5, 15) history.append(f"- {date}: ${amount:.2f} (Paid)") return f""" Billing History: {chr(10).join(history)} Auto-pay: Enabled """ # Default summary response return f""" Account Summary for {customer['name']}: - Account: {customer['account']} - Plan: {customer['plan']} - Amount Due: ${customer['monthly_charge']:.2f} - Due Date: {customer['due_date']} - Data Used: {customer['data_used']:.1f} GB ({customer['data_limit']}) """
The tool uses mock data for demonstration purposes, but the underlying structure is ready. The key is the clear description field as it helps agents understand when to use the tool. The flexible parameters support various query types, and the formatted output is easily readable by humans.
How It Works in Practice
Let's walk through real conversations to see the system in action.
Example 1: Multi-Domain Query
When a user says, "My internet has been slow for the past week and I'm wondering if I'm being throttled because of my data usage. Can you check my current usage and bill?" here's what happens:

The supervisor first routes to the Billing Agent, which checks data usage and plan limits. Then it routes to the Technical Support Agent, which checks for connectivity issues. Finally, the supervisor combines both responses into a coherent answer.
Here is how it looks in the Galileo console, where we can see the traces, spans and metrics of the session.

Galileo automatically captures agent routing decisions, tool invocations with inputs and outputs, all LLM interactions and response times for each step.
Debugging Agents with Metrics
The real value of Galileo comes from understanding how your agents actually perform in production. Let's look at the key metrics.
Action Completion
Action Completion indicates whether your agents are actually assisting or merely responding. It's the difference between an agent that says "I'll check your bill" and one that actually retrieves and explains the charges.

Here's what makes an action complete in our telecom system. The agent must provide a complete answer (not just acknowledge the question), confirm successful actions when handling requests, maintain factual accuracy throughout, address every aspect of the user's query, avoid contradicting tool outputs, and properly summarize all relevant information from tools.

When a customer asks "Why is my bill so high this month?", a complete action retrieves the actual bill, identifies the specific charges causing the increase, compares it to previous months, and suggests ways to reduce it. That's what we're measuring.
A score below 80% usually indicates your agents are either not using their tools properly, providing generic responses instead of specific answers, or failing to follow through on multi-step processes. The beauty of tracking Action Completion is that it directly correlates with customer satisfaction.
Click "Configure Metrics" to open the Metrics Hub and enable the metrics.

Tool Selection Quality
Tool Selection Quality reveals whether your agents are choosing the right tools with the right parameters. It's like having a toolbox—knowing when to use a hammer versus a screwdriver makes all the difference.

If a customer asks "I want to upgrade my plan.", the Plan Advisor agent needs to select the right sequence of tools: first the billing tool to check current usage, then the plan comparison tool to find suitable upgrades, and finally the upgrade tool to process the change. If it skips straight to the upgrade tool without checking usage, it might recommend a plan that doesn't fit the customer's needs.
The metric evaluates two critical aspects: did the agent choose the correct tool, and did it use the correct parameters? In our system, we often see agents choose the right tool but set the parameters incorrectly.
A score below 80% is concerning. It means your agents are winging it instead of using their capabilities. The most common issue is that agents are either too eager or too reluctant to use tools. Too eager means calling tools for questions they can answer directly ("What are your business hours?"). Too reluctant means answering from general knowledge when they should check specific data.

Metric explanations help you quickly identify why your agent is underperforming. When you see low Tool Selection Quality scores, click into the generated reasoning to understand exactly what went wrong. Look for agents consistently choosing the wrong tool for certain query types. Then enhance your tool descriptions.
For example, in this case, the explanation reveals "transferring to the technical-support-agent here does not align well with the user's expressed needs," you've pinpointed a routing logic issue. Use these insights to refine your routing criteria to better match user intent patterns, add conditional checks that evaluate conversation context before transfers, or create intermediate clarification steps for ambiguous queries.

The explanations also highlight limited capabilities. If the reasoning notes "tools available only allow transferring to one of the three agents," you know you need to make architectural changes to allow parallel processing.
Latency Breakdown

The latency trace graph reveals the execution timeline and performance characteristics of our system. Understanding where time is spent helps optimize the user experience by finding the bottlenecks. Galileo’s latency trace graph provides several valuable insights into system behavior and performance:
1. Agent Coordination Patterns The trace reveals how agents collaborate in real-time. We can see the Supervisor consistently orchestrating the conversation while specialized agents (Billing, Technical Support, Plan Advisor) activate only when their expertise is needed. This validates the efficiency of our multi-agent architecture and confirms that agents aren't running unnecessarily.
2. Bottleneck Identification By examining the duration and frequency of each operation, we can pinpoint performance bottlenecks. The call_model
operations exhibit the most significant latency contributions, indicating that LLM inference is the primary factor affecting response time, rather than transfer logic or retrieval operations.
3. Decision Flow Understanding The should_continue
markers illustrate how the system makes routing decisions throughout the conversation. Multiple checks ensure the conversation flows appropriately between agents, and we can trace exactly when and why transfers occur.
4. Retrieval Timing The sparse appearance of pinecone_retrieval
operations shows that knowledge base queries are triggered selectively rather than on every turn, indicating intelligent retrieval logic that balances accuracy with performance.
5. System Responsiveness The overall timeline demonstrates that despite multiple agent handoffs and model calls, the system maintains reasonable end-to-end latency. This validates that our multi-agent approach doesn't introduce prohibitive overhead compared to a single-agent system.
Action items from Latency Trace graph:
Implement prompt caching if model calling operations consistently show high latency
Switch to faster models for routing/decision logic while keeping powerful models for final responses
Parallelize retrieval with other operations instead of running sequentially
These insights help diagnose issues in production and validate architectural choices in our conversational AI system.
Performance Benchmarks for Production
Once your multi-agent system moves from development to production, establishing clear performance targets becomes critical for maintaining user satisfaction and operational excellence. These benchmarks serve multiple purposes: they provide guardrails for deployment decisions, enable objective evaluation of system changes, and help teams identify when performance degradation requires immediate attention.
The metrics below are derived from real-world production data across multiple deployments and correlate directly with user satisfaction scores. Systems that consistently hit "Excellent" targets see significantly higher user retention and completion rates, while those falling into "Needs Improvement" typically generate support tickets and abandoned sessions.
Based on production deployments, here are the targets you should aim for. These are based on what users actually tolerate and what drives satisfaction scores.
Metric | Excellent | Good | Needs Improvement |
---|---|---|---|
Action Completion | > 95% | 85-95% | < 80% |
Tool Selection Quality | > 90% | 85-90% | < 85% |
Avg Response Time | < 2s | 2-4s | > 4s |
Supervisor Routing Accuracy | > 95% | 90-95% | < 90% |

Continuous Improvement with Insights
The moment your multi-agent system hits production, it encounters edge cases you never tested. A user asks "Why is my bill higher this month?" and gets transferred between Billing and Technical Support three times. Your logs show 47 LLM calls, 12 tool invocations, and 8 agent handoffs—but which one failed?
Traditional debugging forces you to:
Manually inspect hundreds of traces to find patterns
Guess whether the issue is in your supervisor's routing prompt, the Billing Agent's tool descriptions, Pinecone retrieval returning irrelevant context, or the fundamental agent architecture
Reproduce failures locally, which often behave differently than production
Wait until multiple users complain before you even know there's a problem
The Insights Engine automates this investigation. Instead of spending hours hunting through traces, Galileo analyzes your entire log stream and surfaces exactly what's broken.


Let's look at the two insights generated in our example and understand how they can guide us in making the agent reliable.
Context Memory Loss Detection: When Galileo identifies agents re-asking for information already provided (like the Plan Advisor requesting usage details after the Billing Agent just discussed them), it pinpoints exactly where conversation context breaks down. The insight shows you the specific span where memory was lost and provides a concrete fix: implement state persistence across agent handoffs or add a shared memory layer. This prevents frustrating user experiences where customers must repeat themselves.
Inefficient Tool Usage Patterns: The Multiple Retrieval Calls insight reveals when agents make redundant API calls that could be batched. Instead of manually reviewing hundreds of traces to find this pattern, Galileo shows you the exact sessions where the Plan Advisor queried the retrieval tool three separate times for different plan categories. The suggested action is immediate: refactor your tool calling logic to accept multiple query parameters or combine similar requests into a single retrieval operation, cutting API costs and reducing latency.
Each insight includes:
Timeline view showing when and how often the issue occurs
Example sessions with direct links to problematic traces
Impact analysis quantifying affected spans over the last two weeks
Use the "Refresh insights" button after deploying fixes to validate your improvements and track whether the issue frequency decreases.
Agent Observability in Production

Agent Quality
Agent Quality metrics measure metrics that directly impact user experience.
Action Completion tracks the percentage of user intents that successfully complete end-to-end. A 95% rate means that the vast majority of requests are fulfilled; however, the remaining 5% represents opportunities for improvement. Combined with Insights Engine, you can identify which specific actions fail most often and why.
Tool Selection Quality measures whether agents choose the right tools for each situation. The 98% score shown indicates highly accurate routing decisions like selecting the appropriate booking tool, knowledge base, or specialized agent for each user need. This is particularly critical for multi-agent systems, where incorrect routing can cascade into unnecessary handoffs and user frustration.
System Metrics
While agent quality measures "did it work correctly," system metrics track "did it work efficiently." These operational metrics ensure your system remains responsive and reliable as traffic scales.
Latency: Average response time reveals how long users wait. The 5.95 second average represents reasonable performance for complex multi-agent interactions. Context matters: simple greetings should respond in under 1 second, single-tool calls in 2-4 seconds, complex workflows in 8-10 seconds. Anything exceeding 15 seconds risks user abandonment.
API Failures: Tracks the reliability of external integrations. Zero failures indicate all API calls succeed. Even small increases deserve investigation as API failures cascade through workflows, causing incomplete actions. Use Insights Engine to identify which APIs fail and under what conditions.
Traces Count: Tracks conversation volume and usage patterns. Spikes indicate increased adoption; drops might signal access issues. If Action Completion drops when traces spike, you have scaling issues.
Custom Metrics for Business-Specific Insights
While Galileo provides comprehensive, out-of-the-box metrics, the real power lies in tracking what matters to your specific business. Let's create a custom metric to track something critical for telecom: unlimited plan recommendations.
In our system, we want to know when agents should be suggesting unlimited plans. Users approaching their data limits are prime candidates for upgrades, but are our agents catching these opportunities?

The LLM judge (GPT-4o in this case) assesses whether the agent correctly identified the opportunity and made an appropriate suggestion.
The evaluation criteria are straightforward:
Successful: The user clearly requested an unlimited plan, and the agent accurately suggested one
Fail: The user requested an unlimited plan, but the agent didn't suggest it or suggested something incorrect
Unrelated: The user's request wasn't about unlimited plans, or the response was unrelated
This metric becomes powerful when combined with business data. If we see high data usage customers not receiving unlimited plan suggestions (lots of "Fail" ratings), we know exactly where to improve our agent prompts. Maybe the billing agent needs clearer instructions about upgrade thresholds.
The beauty of custom metrics is they bridge the gap between generic performance tracking and specific business outcomes. You're not just measuring whether agents work but whether they drive the behaviors that matter to your business.

Looking at this real interaction, you can see the custom metric in action. When a user asks, "hi i am interested to know if you have unlimited plans" the system correctly routes through the Supervisor → Plan Advisor Agent.
Notice how our suggested_unlimited_plan
metric evaluates each step: the Supervisor gets marked as "Successful" for correctly routing to the Plan Advisor, the Plan Advisor gets "Successful" for providing relevant unlimited plan information, and the final Supervisor step is "Successful" for delivering the complete response. The agent responds with specific unlimited plan options and even guides the user to connect with the Plan Advisor Agent for personalized recommendations.
The metric confirms that our agents are accurately capturing the intent and responding appropriately. Over time, this data helps us understand conversion opportunities.
Making Observability Actionable
Raw metrics only matter if they drive action. The Trends dashboard transforms observability from passive monitoring into an active improvement system. Here's exactly how to use it:
Week 1: Establish Your Baseline
Run your chatbot for a week and document normal performance. For the Telecom Chatbot, that's:
90% Tool Selection Quality
95% Action Completion
~6 second latency
These numbers serve as your quality floor and any drop below this threshold triggers an investigation.
Daily: Catch Regressions Immediately
Check the 12H or 1D view each morning. If you see Action Completion suddenly drop from 95% to 85%:
Click the degraded time period to filter traces
Open Log Stream Insights to see which specific agent is failing
Investigate immediately and don't wait for customer complaints
Weekly: Identify Patterns and Plan Fixes
Switch to the 1W view for sprint planning:
Look for recurring patterns
Click into those traces to discover why users behave differently
Add training examples to address the pattern
Track whether the fix works by comparing next week's metrics
Monthly: Validate Major Changes
Before deploying major updates, snapshot your 1M metrics:
Baseline before change: Action Completion 95%, Latency 5.95s
After deployment: Check if metrics stayed stable or degraded
Document the impact: "Added Plan Advisor agent → metrics remained stable → validates architecture"
Configure Alerts

Configure these alerts with specific investigation steps:
Critical Alert: Action Completion < 90%
When triggered, check in this order:
Filter traces to the alert time window and sort by Action Completion score
Identify the failing agent: Is it Billing, Technical Support, or Plan Advisor?
Review Log Stream Insights for automated root cause analysis
Check the latency trace graph: Are tool calls timing out before completion?
Inspect failed tool calls: Are Pinecone retrievals returning empty results? Are transfer functions throwing errors?
Verify external dependencies: Is your knowledge base or CRM API down?
Immediate mitigation: If one agent is broken, route all traffic to working agents while you fix it
Warning Alert: Tool Selection Quality < 85%
Investigation steps:
Click the degraded time period to filter affected sessions
Check which tools are being selected incorrectly: Are users being sent to Technical Support when they need Billing?
Review the supervisor's routing decisions: Open 5-10 failed traces and read the should_continue reasoning
Look for new query patterns: Has user language changed? Are they asking about a new product you haven't trained for?
Compare successful vs. failed routing: What keywords appear in correctly routed queries that are missing in failures?
Fix: Update supervisor prompt with explicit examples of the new query patterns, or enhance tool descriptions with the missing keywords
Performance Alert: Latency > 8 seconds
Debugging checklist:
Switch to latency trace graph to identify the bottleneck operation
Check if LLM operations spiked: Did the LLM provider have an outage or slowdown?
Inspect retrieval latency: Are Pinecone queries taking longer than usual? Check their status page
Count sequential operations: Did a recent change add extra model calls or tool invocations?
Review Traces Count metric: Is high traffic causing queuing delays?
Check for retry loops: Are agents getting stuck in should_continue cycles, repeatedly checking transfer conditions?
Quick fix: Implement timeout thresholds to fail fast instead of waiting for slow operations
Long-term fix: Add prompt caching, parallelize retrieval with routing, or upgrade to faster model tiers
Reliability Alert: API Failures > 5%
Immediate actions:
Check the System Metrics panel to see which time periods have failures
Filter traces with errors and group by failure type
Identify the failing service: Pinecone retrieval? LLM API? Transfer functions?
Review error messages in failed spans: Are they rate limits, server errors, or timeouts?
Check your external service status pages: Is Pinecone, your LLM provider, or your internal CRM experiencing issues?
Review recent code deployments: Did you introduce a bug in tool calling logic?
Emergency response: Implement circuit breakers to stop calling failing services, show graceful error messages to users
Post-incident: Add retries with exponential backoff, or switch to backup services
Volume Alert: Traces Count > 1M/day
Capacity check:
Verify quality metrics remained stable: Did the traffic spike degrade Action Completion or Tool Selection?
Check latency trends: Is response time increasing with load?
Review API Failures: Are you hitting rate limits on external services?
Identify traffic source: Filter by user metadata to see if it's organic growth or a bot attack
If quality stayed high: Document this as validation that your system scales well
If quality degraded: Implement request throttling, add caching layers, or provision more resources
Close the loop after fixing an issue:
Monitor the same metric for 24-48 hours to confirm the fix worked
Update your alert thresholds if you've improved baseline performance
Document the incident with the symptom (what alert fired), root cause (what you found), and solution (what you changed)
Share learnings with your team so the next person can debug faster
Production observability only matters if it guides what you build. Use this dashboard daily to catch failures, weekly to identify patterns, monthly to validate improvements, and yearly to demonstrate that your agent system is creating business impact.
Wrapping Up
Our agent system demonstrates how modern AI orchestration creates sophisticated customer service solutions. By combining LangGraph's agent management with monitoring, you build systems that are both powerful and observable.
The architecture's modularity makes it particularly suitable for enterprises where different teams manage different aspects of customer service. Yes, there's added complexity and some latency overhead. But for production systems handling diverse queries, the benefits of specialization, scalability, and comprehensive monitoring can make it worthwhile.
The complete source code is available in the repository, ready for you to adapt to your own use case. Start with two agents, add monitoring from day one, and scale based on what you learn from real usage. That's how you build AI systems that actually work in production.
You can try Galileo for free to optimise your Langgraph with ease.
Get more insights on building agents in our in-depth eBook:
Choose the right agentic framework for your use case
Evaluate and improve AI agent performance
Identify failure points and production issues

We've all been trapped in a chatbot loop that keeps asking, "I'm sorry, I didn't understand that. Can you rephrase?" Or worse, a bot that cheerfully claims it can help with everything, then fails at everything. You ask about a billing error, but it suggests restarting your router.
Most chatbots attempt to be generalists, but they often end up being proficient in nothing. Real customer service works differently. When you call your telecom provider, you get routed to specialists who actually understand your specific problem.
That's exactly what we're building today! A multi-agent system for ConnectTel that creates delightful customer experiences through intelligent routing to specialized agents. We'll use LangGraph to build it and Galileo for real-time agent metrics, failure insights, and performance tracking so you can continuously improve what's actually working.
Welcome to the fifth installment of the multi-agent series. Don't miss out on other pieces in the Mastering Agent series.
Agent Architecture with Observability

Our system employs a supervisor pattern, where a main coordinator routes customer queries to specialized agents based on the customer's actual needs. We will leverage Chainlit for the conversational UI, Pinecone for vector-based knowledge retrieval, and GPT-4.1 as the underlying LLM.
The beauty of this design is that each agent can be developed, tested, and improved independently. When billing logic changes, you don't touch the technical support code. When you need to add a new capability, you add a new agent without rewriting everything.
However, we don't just want to build the agent; we also want to enable continuous improvement. This will require insights into the failures that happen when our users ask questions we never anticipated and expose edge cases. That is why we are building observability into our system from day one.
Here is how our continuous improvement cycle looks:
Monitor: Track every agent decision, tool call, and routing choice. You need to see not just what agents answer, but how they arrive at those answers. Which agent handled the query? What tools did they use? How long did each step take?
Debug: When things go wrong (and they will), trace through the entire chain. Did the supervisor route to the wrong agent? Did the tool return unexpected data? Did the agent misinterpret the response? Real production failures teach you more than any synthetic test.
Improve: Make targeted fixes based on actual data. Perhaps your billing agent requires more effective prompts for international charges. Technical support may need a timeout for slow diagnostics. The improvements can be measured and you'll know immediately if it worked.
This isn't a one-time optimization. It's an ongoing practice that separates production-ready systems from demos. The architecture we're about to build supports this naturally, with clear separation between agents and comprehensive tracking at every step.
Quick Setup Guide
Let's get this running on your machine. You'll need Python 3.9+, an OpenAI API key, Pinecone and Galileo accounts for the full experience. The complete source code is available in the repository, ready for you to adapt to your own use case.
Installation Steps
First, clone and navigate to the project:
git clone https://github.com/rungalileo/sdk-examples git checkout feature/langgraph-telecom-agent cd sdk-examples/python/agent/langgraph-telecom-agent
Configure your environment variables:
cp .env.example .env # Edit .env with your keys: OPENAI_API_KEY PINECONE_API_KEY GALILEO_API_KEY GALILEO_PROJECT="Multi-Agent Telecom Chatbot" GALILEO_LOG_STREAM="prod"
Install dependencies using uv (recommended) or pip:
# Using uv uv sync --dev # Or using pip pip install -e
In order for agent to act in line with the company guidelines and product, they need to be made aware of the right context. For this, we create a document with the guidelines.
This is the snippet from the troubleshooting guide, which will be chunked and indexed to be used by the Technical Support agent using retrieval.
# Network Troubleshooting Guide ## Common Network Issues and Solutions ### No Signal / No Service #### Symptoms - "No Service" or "Searching" message - Unable to make calls or send texts - No data connection #### Solutions 1. **Toggle Airplane Mode** - Turn ON for 30 seconds - Turn OFF and wait for reconnection 2. **Restart Device** - Power off completely - Wait 30 seconds - Power on 3. **Check SIM Card** - Remove and reinsert SIM - Clean SIM contacts with soft cloth - Try SIM in another device 4. **Reset Network Settings** - iOS: Settings > General > Reset > Reset Network Settings - Android: Settings > System > Reset > Reset Network Settings 5. **Update Carrier Settings** - iOS: Settings > General > About (prompt appears if update available) - Android: Settings > System > Advanced > System Update ### Slow Data Speeds #### Symptoms - Web pages loading slowly - Video buffering frequently - Apps timing out [continued]
Our Plan Advisor and Technical Support agent need the docs to be indexed in Pinecone which is our vector DB.
python ./scripts/setup_pinecone.py or uv run ./scripts/setup_pinecone.py
Once the agent is defined, the Galileo callback can be integrated with Langgraph in just a few lines of code.
from galileo import galileo_context from galileo.handlers.langchain import GalileoAsyncCallback # Initialize Galileo context first galileo_context.init() # Start Galileo session with unique session name galileo_context.start_session(name=session_name, external_id=cl.context.session.id) # Create the callback. This needs to be created in the same thread as the session # so that it uses the same session context. galileo_callback = GalileoAsyncCallback() # Pass the callback to the agent instance supervisor_agent.astream(input=messages, stream_mode="updates", config=RunnableConfig(callbacks=galileo_callback, **config))
Launch the Chainlit interface with our Langgraph application:
chainlit run app.py -w or uv run chainlit run app.py -w
The application will be available at http://localhost:8000. You'll see a chat interface where you can start asking questions about bills, technical issues, or plan recommendations.
Understanding the Core Components
The Supervisor Agent
The supervisor is the brain of our operation. It analyzes incoming queries and routes them to the right specialist. Here's how it works:
def create_supervisor_agent(): """ Create a supervisor agent that manages all the agents in the ConnectTel telecom application. """ checkpointer = MemorySaver() telecom_supervisor_agent = create_supervisor( model=ChatOpenAI(model=os.environ["MODEL_NAME_SUPERVISOR"], name="Supervisor"), agents=[ billing_account_agent, technical_support_agent, plan_advisor_agent, ], prompt=(""" You are a supervisor managing a team of specialized telecom service agents at ConnectTel. Route customer queries to the appropriate agent based on their needs: - Billing Account Agent: Bill inquiries, payment issues, usage tracking - Technical Support Agent: Device troubleshooting, connectivity issues - Plan Advisor Agent: Plan recommendations, upgrades, comparing plans Guidelines: - Route queries to the most appropriate specialist agent - For complex issues spanning multiple areas, coordinate between agents - Be helpful and empathetic to customer concerns """), add_handoff_back_messages=True, output_mode="full_history", supervisor_name="connecttel-supervisor-agent", ).compile(checkpointer=checkpointer) return telecom_supervisor_agent
The supervisor uses LangGraph's built-in create_supervisor
function which handles the complex orchestration logic. This includes message routing based on query analysis, state management across agent interactions, and memory persistence for conversation continuity.
Notice the add_handoff_back_messages=True
parameter. This allows agents to return control to the supervisor when they need help from another agent. It's like a customer service rep saying "Let me transfer you to billing" but happening seamlessly in the background.
Building Specialized Agents
Each specialized agent handles specific types of inquiries. They're built using LangGraph's create_react_agent
pattern, which implements reasoning and action cycles. Let's look at the Billing Account Agent:
from langchain_openai import ChatOpenAI from langgraph.prebuilt import create_react_agent from ..tools.billing_tool import BillingTool def create_billing_account_agent() -> CompiledGraph: """ Create an agent that handles billing inquiries, usage tracking, and account management. """ # Create the billing tool instance billing_tool = BillingTool() # Create a ReAct agent with specialized prompt agent = create_react_agent( model=ChatOpenAI( model=os.environ["MODEL_NAME_WORKER"], name="Billing Account Agent" ), tools=[billing_tool], prompt=(""" You are a Billing and Account specialist for ConnectTel. You help customers with billing inquiries, usage tracking, plan details, and payment issues. Key responsibilities: - Check account balances and payment due dates - Track data, voice, and text usage - Explain charges and fees clearly - Suggest plan optimizations based on usage - Process payment-related inquiries - Review billing history When discussing charges: - Break down costs clearly - Highlight any unusual charges - Suggest ways to reduce bills if usage patterns show opportunity - Always mention auto-pay discounts if not enrolled Be empathetic about high bills and offer solutions. """), name="billing-account-agent", ) return agent
The ReAct pattern means the agent reasons about when to use tools versus when to respond directly. The specialized prompt defines not just what the agent knows, but how it should behave. The empathy instructions are needed for scenarios where people might be frustrated about bills. But how does we measure the empathy? Galileo provides out of the box sentiment metric for measuring the tone.
Creating Effective Tools
Tools are where your agents interact with the real world. They're the bridge between conversation and action. We create a mock billing tool that responds as per the arguments query_type
and customer_id.
The responses are randomly generated and act like real responses for different users.
from typing import Optional from langchain.tools import BaseTool from datetime import datetime, timedelta import random class BillingTool(BaseTool): """Tool for retrieving customer billing and usage information.""" name: str = "billing_account" description: str = "Check account balance, data usage, plan details, and billing history" def _run(self, customer_id: Optional[str] = None, query_type: str = "summary") -> str: """ Get billing and usage information. Args: customer_id: Customer account number (uses default if not provided) query_type: Type of query (summary, usage, plan, history) """ # Mock customer data - in production, this would query real databases customer = { "name": "John Doe", "account": customer_id or "ACC-2024-789456", "plan": "Premium Unlimited 5G", "monthly_charge": 85.00, "data_used": random.uniform(20, 80), "data_limit": "Unlimited", "due_date": (datetime.now() + timedelta(days=15)).strftime("%Y-%m-%d") } if query_type == "usage": return f""" Usage Summary for {customer['name']}: - Data: {customer['data_used']:.1f} GB used ({customer['data_limit']}) - Minutes: {random.randint(300, 800)} (Unlimited) - Texts: {random.randint(500, 2000)} (Unlimited) - Average daily: {customer['data_used'] / 15:.2f} GB """ elif query_type == "plan": return f""" Current Plan: {customer['plan']} - Monthly Cost: ${customer['monthly_charge']:.2f} - Data: {customer['data_limit']} - Talk & Text: Unlimited - 5G Access: Included Available Upgrades: - Business Elite ($120/month) - International Plus ($95/month) """ elif query_type == "history": history = [] for i in range(3): date = (datetime.now() - timedelta(days=30*(i+1))).strftime("%Y-%m-%d") amount = customer['monthly_charge'] + random.uniform(-5, 15) history.append(f"- {date}: ${amount:.2f} (Paid)") return f""" Billing History: {chr(10).join(history)} Auto-pay: Enabled """ # Default summary response return f""" Account Summary for {customer['name']}: - Account: {customer['account']} - Plan: {customer['plan']} - Amount Due: ${customer['monthly_charge']:.2f} - Due Date: {customer['due_date']} - Data Used: {customer['data_used']:.1f} GB ({customer['data_limit']}) """
The tool uses mock data for demonstration purposes, but the underlying structure is ready. The key is the clear description field as it helps agents understand when to use the tool. The flexible parameters support various query types, and the formatted output is easily readable by humans.
How It Works in Practice
Let's walk through real conversations to see the system in action.
Example 1: Multi-Domain Query
When a user says, "My internet has been slow for the past week and I'm wondering if I'm being throttled because of my data usage. Can you check my current usage and bill?" here's what happens:

The supervisor first routes to the Billing Agent, which checks data usage and plan limits. Then it routes to the Technical Support Agent, which checks for connectivity issues. Finally, the supervisor combines both responses into a coherent answer.
Here is how it looks in the Galileo console, where we can see the traces, spans and metrics of the session.

Galileo automatically captures agent routing decisions, tool invocations with inputs and outputs, all LLM interactions and response times for each step.
Debugging Agents with Metrics
The real value of Galileo comes from understanding how your agents actually perform in production. Let's look at the key metrics.
Action Completion
Action Completion indicates whether your agents are actually assisting or merely responding. It's the difference between an agent that says "I'll check your bill" and one that actually retrieves and explains the charges.

Here's what makes an action complete in our telecom system. The agent must provide a complete answer (not just acknowledge the question), confirm successful actions when handling requests, maintain factual accuracy throughout, address every aspect of the user's query, avoid contradicting tool outputs, and properly summarize all relevant information from tools.

When a customer asks "Why is my bill so high this month?", a complete action retrieves the actual bill, identifies the specific charges causing the increase, compares it to previous months, and suggests ways to reduce it. That's what we're measuring.
A score below 80% usually indicates your agents are either not using their tools properly, providing generic responses instead of specific answers, or failing to follow through on multi-step processes. The beauty of tracking Action Completion is that it directly correlates with customer satisfaction.
Click "Configure Metrics" to open the Metrics Hub and enable the metrics.

Tool Selection Quality
Tool Selection Quality reveals whether your agents are choosing the right tools with the right parameters. It's like having a toolbox—knowing when to use a hammer versus a screwdriver makes all the difference.

If a customer asks "I want to upgrade my plan.", the Plan Advisor agent needs to select the right sequence of tools: first the billing tool to check current usage, then the plan comparison tool to find suitable upgrades, and finally the upgrade tool to process the change. If it skips straight to the upgrade tool without checking usage, it might recommend a plan that doesn't fit the customer's needs.
The metric evaluates two critical aspects: did the agent choose the correct tool, and did it use the correct parameters? In our system, we often see agents choose the right tool but set the parameters incorrectly.
A score below 80% is concerning. It means your agents are winging it instead of using their capabilities. The most common issue is that agents are either too eager or too reluctant to use tools. Too eager means calling tools for questions they can answer directly ("What are your business hours?"). Too reluctant means answering from general knowledge when they should check specific data.

Metric explanations help you quickly identify why your agent is underperforming. When you see low Tool Selection Quality scores, click into the generated reasoning to understand exactly what went wrong. Look for agents consistently choosing the wrong tool for certain query types. Then enhance your tool descriptions.
For example, in this case, the explanation reveals "transferring to the technical-support-agent here does not align well with the user's expressed needs," you've pinpointed a routing logic issue. Use these insights to refine your routing criteria to better match user intent patterns, add conditional checks that evaluate conversation context before transfers, or create intermediate clarification steps for ambiguous queries.

The explanations also highlight limited capabilities. If the reasoning notes "tools available only allow transferring to one of the three agents," you know you need to make architectural changes to allow parallel processing.
Latency Breakdown

The latency trace graph reveals the execution timeline and performance characteristics of our system. Understanding where time is spent helps optimize the user experience by finding the bottlenecks. Galileo’s latency trace graph provides several valuable insights into system behavior and performance:
1. Agent Coordination Patterns The trace reveals how agents collaborate in real-time. We can see the Supervisor consistently orchestrating the conversation while specialized agents (Billing, Technical Support, Plan Advisor) activate only when their expertise is needed. This validates the efficiency of our multi-agent architecture and confirms that agents aren't running unnecessarily.
2. Bottleneck Identification By examining the duration and frequency of each operation, we can pinpoint performance bottlenecks. The call_model
operations exhibit the most significant latency contributions, indicating that LLM inference is the primary factor affecting response time, rather than transfer logic or retrieval operations.
3. Decision Flow Understanding The should_continue
markers illustrate how the system makes routing decisions throughout the conversation. Multiple checks ensure the conversation flows appropriately between agents, and we can trace exactly when and why transfers occur.
4. Retrieval Timing The sparse appearance of pinecone_retrieval
operations shows that knowledge base queries are triggered selectively rather than on every turn, indicating intelligent retrieval logic that balances accuracy with performance.
5. System Responsiveness The overall timeline demonstrates that despite multiple agent handoffs and model calls, the system maintains reasonable end-to-end latency. This validates that our multi-agent approach doesn't introduce prohibitive overhead compared to a single-agent system.
Action items from Latency Trace graph:
Implement prompt caching if model calling operations consistently show high latency
Switch to faster models for routing/decision logic while keeping powerful models for final responses
Parallelize retrieval with other operations instead of running sequentially
These insights help diagnose issues in production and validate architectural choices in our conversational AI system.
Performance Benchmarks for Production
Once your multi-agent system moves from development to production, establishing clear performance targets becomes critical for maintaining user satisfaction and operational excellence. These benchmarks serve multiple purposes: they provide guardrails for deployment decisions, enable objective evaluation of system changes, and help teams identify when performance degradation requires immediate attention.
The metrics below are derived from real-world production data across multiple deployments and correlate directly with user satisfaction scores. Systems that consistently hit "Excellent" targets see significantly higher user retention and completion rates, while those falling into "Needs Improvement" typically generate support tickets and abandoned sessions.
Based on production deployments, here are the targets you should aim for. These are based on what users actually tolerate and what drives satisfaction scores.
Metric | Excellent | Good | Needs Improvement |
---|---|---|---|
Action Completion | > 95% | 85-95% | < 80% |
Tool Selection Quality | > 90% | 85-90% | < 85% |
Avg Response Time | < 2s | 2-4s | > 4s |
Supervisor Routing Accuracy | > 95% | 90-95% | < 90% |

Continuous Improvement with Insights
The moment your multi-agent system hits production, it encounters edge cases you never tested. A user asks "Why is my bill higher this month?" and gets transferred between Billing and Technical Support three times. Your logs show 47 LLM calls, 12 tool invocations, and 8 agent handoffs—but which one failed?
Traditional debugging forces you to:
Manually inspect hundreds of traces to find patterns
Guess whether the issue is in your supervisor's routing prompt, the Billing Agent's tool descriptions, Pinecone retrieval returning irrelevant context, or the fundamental agent architecture
Reproduce failures locally, which often behave differently than production
Wait until multiple users complain before you even know there's a problem
The Insights Engine automates this investigation. Instead of spending hours hunting through traces, Galileo analyzes your entire log stream and surfaces exactly what's broken.


Let's look at the two insights generated in our example and understand how they can guide us in making the agent reliable.
Context Memory Loss Detection: When Galileo identifies agents re-asking for information already provided (like the Plan Advisor requesting usage details after the Billing Agent just discussed them), it pinpoints exactly where conversation context breaks down. The insight shows you the specific span where memory was lost and provides a concrete fix: implement state persistence across agent handoffs or add a shared memory layer. This prevents frustrating user experiences where customers must repeat themselves.
Inefficient Tool Usage Patterns: The Multiple Retrieval Calls insight reveals when agents make redundant API calls that could be batched. Instead of manually reviewing hundreds of traces to find this pattern, Galileo shows you the exact sessions where the Plan Advisor queried the retrieval tool three separate times for different plan categories. The suggested action is immediate: refactor your tool calling logic to accept multiple query parameters or combine similar requests into a single retrieval operation, cutting API costs and reducing latency.
Each insight includes:
Timeline view showing when and how often the issue occurs
Example sessions with direct links to problematic traces
Impact analysis quantifying affected spans over the last two weeks
Use the "Refresh insights" button after deploying fixes to validate your improvements and track whether the issue frequency decreases.
Agent Observability in Production

Agent Quality
Agent Quality metrics measure metrics that directly impact user experience.
Action Completion tracks the percentage of user intents that successfully complete end-to-end. A 95% rate means that the vast majority of requests are fulfilled; however, the remaining 5% represents opportunities for improvement. Combined with Insights Engine, you can identify which specific actions fail most often and why.
Tool Selection Quality measures whether agents choose the right tools for each situation. The 98% score shown indicates highly accurate routing decisions like selecting the appropriate booking tool, knowledge base, or specialized agent for each user need. This is particularly critical for multi-agent systems, where incorrect routing can cascade into unnecessary handoffs and user frustration.
System Metrics
While agent quality measures "did it work correctly," system metrics track "did it work efficiently." These operational metrics ensure your system remains responsive and reliable as traffic scales.
Latency: Average response time reveals how long users wait. The 5.95 second average represents reasonable performance for complex multi-agent interactions. Context matters: simple greetings should respond in under 1 second, single-tool calls in 2-4 seconds, complex workflows in 8-10 seconds. Anything exceeding 15 seconds risks user abandonment.
API Failures: Tracks the reliability of external integrations. Zero failures indicate all API calls succeed. Even small increases deserve investigation as API failures cascade through workflows, causing incomplete actions. Use Insights Engine to identify which APIs fail and under what conditions.
Traces Count: Tracks conversation volume and usage patterns. Spikes indicate increased adoption; drops might signal access issues. If Action Completion drops when traces spike, you have scaling issues.
Custom Metrics for Business-Specific Insights
While Galileo provides comprehensive, out-of-the-box metrics, the real power lies in tracking what matters to your specific business. Let's create a custom metric to track something critical for telecom: unlimited plan recommendations.
In our system, we want to know when agents should be suggesting unlimited plans. Users approaching their data limits are prime candidates for upgrades, but are our agents catching these opportunities?

The LLM judge (GPT-4o in this case) assesses whether the agent correctly identified the opportunity and made an appropriate suggestion.
The evaluation criteria are straightforward:
Successful: The user clearly requested an unlimited plan, and the agent accurately suggested one
Fail: The user requested an unlimited plan, but the agent didn't suggest it or suggested something incorrect
Unrelated: The user's request wasn't about unlimited plans, or the response was unrelated
This metric becomes powerful when combined with business data. If we see high data usage customers not receiving unlimited plan suggestions (lots of "Fail" ratings), we know exactly where to improve our agent prompts. Maybe the billing agent needs clearer instructions about upgrade thresholds.
The beauty of custom metrics is they bridge the gap between generic performance tracking and specific business outcomes. You're not just measuring whether agents work but whether they drive the behaviors that matter to your business.

Looking at this real interaction, you can see the custom metric in action. When a user asks, "hi i am interested to know if you have unlimited plans" the system correctly routes through the Supervisor → Plan Advisor Agent.
Notice how our suggested_unlimited_plan
metric evaluates each step: the Supervisor gets marked as "Successful" for correctly routing to the Plan Advisor, the Plan Advisor gets "Successful" for providing relevant unlimited plan information, and the final Supervisor step is "Successful" for delivering the complete response. The agent responds with specific unlimited plan options and even guides the user to connect with the Plan Advisor Agent for personalized recommendations.
The metric confirms that our agents are accurately capturing the intent and responding appropriately. Over time, this data helps us understand conversion opportunities.
Making Observability Actionable
Raw metrics only matter if they drive action. The Trends dashboard transforms observability from passive monitoring into an active improvement system. Here's exactly how to use it:
Week 1: Establish Your Baseline
Run your chatbot for a week and document normal performance. For the Telecom Chatbot, that's:
90% Tool Selection Quality
95% Action Completion
~6 second latency
These numbers serve as your quality floor and any drop below this threshold triggers an investigation.
Daily: Catch Regressions Immediately
Check the 12H or 1D view each morning. If you see Action Completion suddenly drop from 95% to 85%:
Click the degraded time period to filter traces
Open Log Stream Insights to see which specific agent is failing
Investigate immediately and don't wait for customer complaints
Weekly: Identify Patterns and Plan Fixes
Switch to the 1W view for sprint planning:
Look for recurring patterns
Click into those traces to discover why users behave differently
Add training examples to address the pattern
Track whether the fix works by comparing next week's metrics
Monthly: Validate Major Changes
Before deploying major updates, snapshot your 1M metrics:
Baseline before change: Action Completion 95%, Latency 5.95s
After deployment: Check if metrics stayed stable or degraded
Document the impact: "Added Plan Advisor agent → metrics remained stable → validates architecture"
Configure Alerts

Configure these alerts with specific investigation steps:
Critical Alert: Action Completion < 90%
When triggered, check in this order:
Filter traces to the alert time window and sort by Action Completion score
Identify the failing agent: Is it Billing, Technical Support, or Plan Advisor?
Review Log Stream Insights for automated root cause analysis
Check the latency trace graph: Are tool calls timing out before completion?
Inspect failed tool calls: Are Pinecone retrievals returning empty results? Are transfer functions throwing errors?
Verify external dependencies: Is your knowledge base or CRM API down?
Immediate mitigation: If one agent is broken, route all traffic to working agents while you fix it
Warning Alert: Tool Selection Quality < 85%
Investigation steps:
Click the degraded time period to filter affected sessions
Check which tools are being selected incorrectly: Are users being sent to Technical Support when they need Billing?
Review the supervisor's routing decisions: Open 5-10 failed traces and read the should_continue reasoning
Look for new query patterns: Has user language changed? Are they asking about a new product you haven't trained for?
Compare successful vs. failed routing: What keywords appear in correctly routed queries that are missing in failures?
Fix: Update supervisor prompt with explicit examples of the new query patterns, or enhance tool descriptions with the missing keywords
Performance Alert: Latency > 8 seconds
Debugging checklist:
Switch to latency trace graph to identify the bottleneck operation
Check if LLM operations spiked: Did the LLM provider have an outage or slowdown?
Inspect retrieval latency: Are Pinecone queries taking longer than usual? Check their status page
Count sequential operations: Did a recent change add extra model calls or tool invocations?
Review Traces Count metric: Is high traffic causing queuing delays?
Check for retry loops: Are agents getting stuck in should_continue cycles, repeatedly checking transfer conditions?
Quick fix: Implement timeout thresholds to fail fast instead of waiting for slow operations
Long-term fix: Add prompt caching, parallelize retrieval with routing, or upgrade to faster model tiers
Reliability Alert: API Failures > 5%
Immediate actions:
Check the System Metrics panel to see which time periods have failures
Filter traces with errors and group by failure type
Identify the failing service: Pinecone retrieval? LLM API? Transfer functions?
Review error messages in failed spans: Are they rate limits, server errors, or timeouts?
Check your external service status pages: Is Pinecone, your LLM provider, or your internal CRM experiencing issues?
Review recent code deployments: Did you introduce a bug in tool calling logic?
Emergency response: Implement circuit breakers to stop calling failing services, show graceful error messages to users
Post-incident: Add retries with exponential backoff, or switch to backup services
Volume Alert: Traces Count > 1M/day
Capacity check:
Verify quality metrics remained stable: Did the traffic spike degrade Action Completion or Tool Selection?
Check latency trends: Is response time increasing with load?
Review API Failures: Are you hitting rate limits on external services?
Identify traffic source: Filter by user metadata to see if it's organic growth or a bot attack
If quality stayed high: Document this as validation that your system scales well
If quality degraded: Implement request throttling, add caching layers, or provision more resources
Close the loop after fixing an issue:
Monitor the same metric for 24-48 hours to confirm the fix worked
Update your alert thresholds if you've improved baseline performance
Document the incident with the symptom (what alert fired), root cause (what you found), and solution (what you changed)
Share learnings with your team so the next person can debug faster
Production observability only matters if it guides what you build. Use this dashboard daily to catch failures, weekly to identify patterns, monthly to validate improvements, and yearly to demonstrate that your agent system is creating business impact.
Wrapping Up
Our agent system demonstrates how modern AI orchestration creates sophisticated customer service solutions. By combining LangGraph's agent management with monitoring, you build systems that are both powerful and observable.
The architecture's modularity makes it particularly suitable for enterprises where different teams manage different aspects of customer service. Yes, there's added complexity and some latency overhead. But for production systems handling diverse queries, the benefits of specialization, scalability, and comprehensive monitoring can make it worthwhile.
The complete source code is available in the repository, ready for you to adapt to your own use case. Start with two agents, add monitoring from day one, and scale based on what you learn from real usage. That's how you build AI systems that actually work in production.
You can try Galileo for free to optimise your Langgraph with ease.
Get more insights on building agents in our in-depth eBook:
Choose the right agentic framework for your use case
Evaluate and improve AI agent performance
Identify failure points and production issues

If you find this helpful and interesting,


Pratik Bhavsar