Jan 21, 2026
Context Engineering at Scale: How We Built Galileo Signals


Bipin Shetty
AI/Software Engineer
Bipin Shetty
AI/Software Engineer


What if your evaluation system got smarter after every failure?
We built something that shouldn't work: an AI system that maintains perfect memory of every issue it's ever detected across your entire agent infrastructure—then uses that knowledge to spot patterns you'd never think to look for.
Why shouldn't it work? Three fundamental constraints made this nearly impossible. First, LLM context windows are limited, and our test dataset was a 25MB file that crushed all of them. Second, simple memory solutions like RAG didn't capture the nuance of AI failure modes—you need to understand not just what happened, but how patterns evolve over time. Third, the cost at scale would explode exponentially across an enterprise customer base if we weren't careful about how we processed data.
This is why most LLM features are stateless. You ask, they answer, they forget. Galileo Signals takes a different approach, maintaining condensed institutional knowledge across weeks and months. Each new analysis builds on every previous finding. It's like having a senior engineer who's reviewed every trace your system has ever produced and can instantly recognize when a new problem matches an old pattern.
The core challenge we set out to solve was this: How do you detect "unknown unknowns" in agentic systems at production scale without exploding costs? Traditional observability is reactive—you write evals for what you know can fail. But agents fail in ways too subtle for human-defined metrics. For example, an agent might leak data between customers with similar names across multi-turn conversations. No metric catches this because you didn't know to look for it. No search query finds it because you don't know what to search for. But Signals does, because it's designed to find problems you don't know exist.
Why This Is Actually Hard
When we started building this, the naive approach seemed tempting: just send your logs to GPT and ask it what went wrong. This fails immediately for several reasons that aren't obvious until you try it.
First, stateless analysis means you get the same observations repeated every run. The LLM tells you about a tool error on Monday, then tells you about the same tool error on Friday, with no understanding that it has seen this pattern before. Second, there's no pattern recognition across time windows—you can't detect that an issue from Week 1 and Week 4 are manifestations of the same underlying problem. Third, running LLM inference per-trace results in a cost explosion, making the approach economically unviable at scale. And finally, you simply can't handle the data volume—25MB+ of trace data per run is too large to process in a single shot.
Our team quickly realized we were facing what became the most trendy problem of 2025: context engineering. We needed to achieve near-perfect compression of previous context while limiting the current batch to a representative sample over a variable time period. The challenge wasn't just making things smaller, it was preserving exactly the right information while discarding everything else.
The Architecture: Solving Compression at Three Levels
We solved this challenge with a multi-stage pipeline, where each stage addresses a specific constraint in the system.
Step 1.1 applies lossless programmatic compression: whitelisting relevant fields, deduplicating tool schemas, compressing repeated messages. This reduces the raw spans without sacrificing relevant information for pattern detection.
Step 1.2 uses an advanced reasoning model to distill each session into structured notes that capture "everything noteworthy" in dramatically less space (~500KB total). These notes preserve what matters—which patterns occurred and in which spans—while discarding verbosity.
Finally, Step 2 ingests all notes together (now small enough to fit in one context window) along with a historical summary of previously identified signals, and uses an LLM to perform cross-session pattern detection, generating up to 5 priority-ranked signal cards. This architecture maintains the critical "see everything at once" property needed for detecting systemic issues while working within the practical constraints of context windows and API costs.
The Full Pipeline
Raw Spans (25MB) ↓ [Step 1.1] Programmatic Compression ↓ Compressed Spans ↓ [Step 1.2] LLM Note-Taking ↓ Distilled Notes ↓ [Step 2] Signal Generation ↓ Signal Cards (5 max, priority-ranked)
The design choice here is deliberate: we use two-stage LLM processing only because context windows aren't infinite. Everything else is single-pass to maximize information flow between stages and ensure the system can draw connections across the entire dataset.
But compression alone doesn't solve the hardest part of the problem. The real breakthrough came from how we handle memory and pattern recognition over time.
Context Engineering: The Secret Sauce
To understand why our approach works, it helps to see what doesn't work. Here's how traditional "chat with logs" features operate:
User Query → LLM (fresh logs) → Answer → Forget Everything
Every interaction is independent. The LLM has no memory of what it told you yesterday, what patterns it's seen before, or how current behavior compares to historical baselines. You're starting from scratch every time.
Here's how Signals works differently:
def generate_insights(new_traces, historical_signals): # Load condensed knowledge from all previous runs context = compress_historical_patterns(historical_signals) # LLM gets BOTH new data AND institutional memory prompt = f""" Historical patterns detected: {context} New traces to analyze: {new_traces} Task: 1. If new traces match existing patterns, add to that signal 2. If genuinely novel pattern, create new signal 3. Do NOT duplicate existing signals """ return llm.analyze(prompt, model="claude-sonnet-4")
The key innovation is that the system maintains institutional memory. New traces automatically group with existing patterns. Signals evolve and compound over time. The LLM sees the full picture, not just today's data. This creates a fundamentally different capability—the system gets smarter with every run instead of starting fresh each time.
This approach required us to solve several subtle technical challenges that weren't obvious at the start.
Preventing Duplicate Detection
Having institutional memory creates a new problem: without careful prompt engineering, an LLM will generate "New signal: Tool call errors in checkout flow" on every run, even if it found this exact pattern 3 weeks ago. The LLM doesn't naturally understand that finding the same pattern again isn't news—it's confirmation.
Our solution has three parts. First, we pass ALL existing signal IDs and descriptions to the LLM in every request. Second, we give explicit instructions: "These patterns already exist - do not recreate them." Third, and most importantly, we ask the LLM to perform structured reasoning: "Is this trace a new manifestation of signal #47 or genuinely novel?"
This turns out to be non-trivial. The LLM must reason about similarity across different failure manifestations, since the same root cause can manifest differently across contexts. This requires balancing granularity (creating too many narrowly-defined signals) against aggregation (grouping too many distinct issues together).
Here's what the LLM's reasoning process looks like (from extended thinking):
New trace 045 shows agent calling get_weather with invalid format. This matches the pattern in existing Signal #7. The error message is identical to examples already in Signal #7. Conclusion: NOT a new pattern. Add trace 045 to Signal #7.
versus:
New trace 051 shows agent exposing customer A's order to customer B. Existing signals: tool errors (#7), hallucinations (#12), latency (#3). None describe cross-customer data leakage. Conclusion: NEW pattern. Create signal about privacy violation
The system has learned to distinguish between "I've seen this before" and "this is genuinely new." That distinction is what makes the institutional memory useful rather than just noisy.
The First Stage: Programmatic Compression
Before any LLM even sees the data, we apply lossless compression that reduces size without losing information relevant to pattern detection. This stage is deterministic and fast.
Field whitelisting drops project metadata, user IDs, and timestamps while keeping only behavior-relevant fields. Tool definition deduplication replaces repeated tool schemas with "<same as previous>". Message accumulation detection recognizes that in stateful LLM interactions, messages accumulate predictably, so we compress repeated context into references.
The result: 25MB → ~7MB with zero information loss for pattern detection purposes. This doesn't sound like much, but it's enough to make the next stages tractable.
Real-World Validation
Building a system that detects unknown problems creates an obvious challenge: how do you validate that it works? You can't create a labeled test set of "unknown unknowns" because by definition, you don't know what they are. We needed a different approach.
The NVIDIA Stress Test
NVIDIA provided us with 25MB of production traces containing known issues, but critically, they didn't tell us what those issues were. The challenge was simple: Can your system surface our problems without any hints?
This became our primary validation dataset for v2.0 development. We iterated on prompt engineering, compression strategies, and pattern recognition until Signals reliably identified their problematic patterns autonomously. The test proved the system could work at enterprise data scales and find real issues that engineers cared about.
Production Deployments
Beyond controlled testing, we needed to see how Signals performed in actual production environments with real users. A large agent platform customer runs weekly automated batches, generating 200-300 signals per week. This proves Signals works as an operational practice, not just a one-off analysis tool.
Across our early access customers, we see hundreds of signals generated daily, spanning priority 1 (interesting but low-urgency patterns) to priority 10 (critical failures requiring immediate attention). The distribution tells us something important: most AI systems have a handful of critical issues and a long tail of optimization opportunities.
Patterns We've Actually Detected
The real validation comes from the kinds of patterns Signals surface in production. These are issues that wouldn't be caught by traditional metrics or manual inspection.
The Customer Data Leak (Priority 10): In a multi-turn airline agent conversation, the agent matched customers by name only rather than unique ID. This caused the agent to retrieve Customer A's booking history while responding to Customer B. The pattern appeared across three sessions before Signals caught it. The category is what we call a hybrid failure—simultaneously a security vulnerability and a hallucination, making it particularly insidious.
Tool Call Cascade (Priority 8): Each tool in a 5-step workflow had an acceptable ~5% failure rate when viewed in isolation. But across 100 workflows, 23 ended with incorrect output due to unrecovered tool failures. The pattern was invisible at the span level—each individual tool call looked fine—but obvious at the session level when you could see the cascade effect.
Policy Drift (Priority 7): A customer service agent gradually stopped following refund policies over a 2-week period. Week 1 showed strict adherence. Week 2 began approving gray areas. Week 3 approved an explicitly prohibited case. The metric "fail" came in week 3, but as a one-off could have easily been ignored as an outlier. Instead, Signals identified the drift over time and suggested a fix to reset compliance.
These examples illustrate what makes unknown unknowns hard to catch: they're either too subtle to trigger simple thresholds, too distributed to see in local views, or too gradual to notice in real-time monitoring.
TECHNICAL SIDEBAR: The Priority Triage System
Signals are priority-ranked 1-10 using a decision tree that considers multiple factors:
Priority 8-10 ("error"): Failures in application output
Privacy/compliance risks → 10
Monetary/reputational risks → 9
User frustration → 8
Priority 4-7 ("warning"): Recoverable issues, inefficiencies
Tool errors with recovery
Unnecessary tool usage
Inconsistent outputs
Priority 1-3 ("info"): Notable patterns, correct edge case handling
The scoring considers severity of impact, confidence in assessment, difficulty to discover manually, and actionability of the signal. The system fundamentally answers: "If the user never saw this, how bad would that be?" The highest priority signals represent the biggest loss from not knowing.
Closing the Loop: From Signals to Guardrails
Detecting unknown problems is valuable, but the real power comes from converting discoveries into ongoing monitoring. After identifying an important signal, users typically want to create an eval metric to track it over time. This used to require manually writing evaluation logic, which was tedious and error-prone.
We realized the same context used to identify the signal could be used to generate the eval automatically. This closes what we call the eval engineering loop: your observability system teaches itself what to watch for.
Here's how it works in practice:
Signals detects an unknown pattern: "LLM ignores error messages in tool calls"
User reviews the Signal, and clicks “Create Metric” from the Signal detail page
System generates custom eval checking for repeated tool calls with the same error
Today's unknown unknown becomes tomorrow's known guardrail
The system goes from discovering a problem to preventing future occurrences of that problem, with minimal manual effort. This is particularly powerful because it means your evaluation suite evolves with your system—as new failure modes emerge, they are automatically monitored.
Conclusion
Agentic systems are too complex for human supervision alone. You can write evals for what you know can fail, but the real risk is what you don't know—the subtle cascading failures, the edge case interactions, the security vulnerabilities that only emerge across hundreds of conversations.
Context engineering solves this by building systems that learn what to watch for, remember what they've seen, and get smarter with every analysis. It's not about replacing human judgment; it's about augmenting it with a system that can process scale and temporal patterns that humans simply can't track manually.
As agents become more complex and autonomous, quality assurance must become more autonomous as well. Signals represent our first step toward AI systems that can reliably supervise other AI systems at production scale. The techniques we've described—multi-stage compression, institutional memory, duplicate prevention, priority triage—form a foundation that will need to evolve as the systems we're monitoring become more sophisticated.
We'd love your feedback: Try Signals on your agent traces and let us know what patterns it surfaces. The more developers use it, the more we learn about what "unknown unknowns" actually look like in the wild, and the better we can make these systems at finding them.
Galileo Signals launches January 21, 2026.
What if your evaluation system got smarter after every failure?
We built something that shouldn't work: an AI system that maintains perfect memory of every issue it's ever detected across your entire agent infrastructure—then uses that knowledge to spot patterns you'd never think to look for.
Why shouldn't it work? Three fundamental constraints made this nearly impossible. First, LLM context windows are limited, and our test dataset was a 25MB file that crushed all of them. Second, simple memory solutions like RAG didn't capture the nuance of AI failure modes—you need to understand not just what happened, but how patterns evolve over time. Third, the cost at scale would explode exponentially across an enterprise customer base if we weren't careful about how we processed data.
This is why most LLM features are stateless. You ask, they answer, they forget. Galileo Signals takes a different approach, maintaining condensed institutional knowledge across weeks and months. Each new analysis builds on every previous finding. It's like having a senior engineer who's reviewed every trace your system has ever produced and can instantly recognize when a new problem matches an old pattern.
The core challenge we set out to solve was this: How do you detect "unknown unknowns" in agentic systems at production scale without exploding costs? Traditional observability is reactive—you write evals for what you know can fail. But agents fail in ways too subtle for human-defined metrics. For example, an agent might leak data between customers with similar names across multi-turn conversations. No metric catches this because you didn't know to look for it. No search query finds it because you don't know what to search for. But Signals does, because it's designed to find problems you don't know exist.
Why This Is Actually Hard
When we started building this, the naive approach seemed tempting: just send your logs to GPT and ask it what went wrong. This fails immediately for several reasons that aren't obvious until you try it.
First, stateless analysis means you get the same observations repeated every run. The LLM tells you about a tool error on Monday, then tells you about the same tool error on Friday, with no understanding that it has seen this pattern before. Second, there's no pattern recognition across time windows—you can't detect that an issue from Week 1 and Week 4 are manifestations of the same underlying problem. Third, running LLM inference per-trace results in a cost explosion, making the approach economically unviable at scale. And finally, you simply can't handle the data volume—25MB+ of trace data per run is too large to process in a single shot.
Our team quickly realized we were facing what became the most trendy problem of 2025: context engineering. We needed to achieve near-perfect compression of previous context while limiting the current batch to a representative sample over a variable time period. The challenge wasn't just making things smaller, it was preserving exactly the right information while discarding everything else.
The Architecture: Solving Compression at Three Levels
We solved this challenge with a multi-stage pipeline, where each stage addresses a specific constraint in the system.
Step 1.1 applies lossless programmatic compression: whitelisting relevant fields, deduplicating tool schemas, compressing repeated messages. This reduces the raw spans without sacrificing relevant information for pattern detection.
Step 1.2 uses an advanced reasoning model to distill each session into structured notes that capture "everything noteworthy" in dramatically less space (~500KB total). These notes preserve what matters—which patterns occurred and in which spans—while discarding verbosity.
Finally, Step 2 ingests all notes together (now small enough to fit in one context window) along with a historical summary of previously identified signals, and uses an LLM to perform cross-session pattern detection, generating up to 5 priority-ranked signal cards. This architecture maintains the critical "see everything at once" property needed for detecting systemic issues while working within the practical constraints of context windows and API costs.
The Full Pipeline
Raw Spans (25MB) ↓ [Step 1.1] Programmatic Compression ↓ Compressed Spans ↓ [Step 1.2] LLM Note-Taking ↓ Distilled Notes ↓ [Step 2] Signal Generation ↓ Signal Cards (5 max, priority-ranked)
The design choice here is deliberate: we use two-stage LLM processing only because context windows aren't infinite. Everything else is single-pass to maximize information flow between stages and ensure the system can draw connections across the entire dataset.
But compression alone doesn't solve the hardest part of the problem. The real breakthrough came from how we handle memory and pattern recognition over time.
Context Engineering: The Secret Sauce
To understand why our approach works, it helps to see what doesn't work. Here's how traditional "chat with logs" features operate:
User Query → LLM (fresh logs) → Answer → Forget Everything
Every interaction is independent. The LLM has no memory of what it told you yesterday, what patterns it's seen before, or how current behavior compares to historical baselines. You're starting from scratch every time.
Here's how Signals works differently:
def generate_insights(new_traces, historical_signals): # Load condensed knowledge from all previous runs context = compress_historical_patterns(historical_signals) # LLM gets BOTH new data AND institutional memory prompt = f""" Historical patterns detected: {context} New traces to analyze: {new_traces} Task: 1. If new traces match existing patterns, add to that signal 2. If genuinely novel pattern, create new signal 3. Do NOT duplicate existing signals """ return llm.analyze(prompt, model="claude-sonnet-4")
The key innovation is that the system maintains institutional memory. New traces automatically group with existing patterns. Signals evolve and compound over time. The LLM sees the full picture, not just today's data. This creates a fundamentally different capability—the system gets smarter with every run instead of starting fresh each time.
This approach required us to solve several subtle technical challenges that weren't obvious at the start.
Preventing Duplicate Detection
Having institutional memory creates a new problem: without careful prompt engineering, an LLM will generate "New signal: Tool call errors in checkout flow" on every run, even if it found this exact pattern 3 weeks ago. The LLM doesn't naturally understand that finding the same pattern again isn't news—it's confirmation.
Our solution has three parts. First, we pass ALL existing signal IDs and descriptions to the LLM in every request. Second, we give explicit instructions: "These patterns already exist - do not recreate them." Third, and most importantly, we ask the LLM to perform structured reasoning: "Is this trace a new manifestation of signal #47 or genuinely novel?"
This turns out to be non-trivial. The LLM must reason about similarity across different failure manifestations, since the same root cause can manifest differently across contexts. This requires balancing granularity (creating too many narrowly-defined signals) against aggregation (grouping too many distinct issues together).
Here's what the LLM's reasoning process looks like (from extended thinking):
New trace 045 shows agent calling get_weather with invalid format. This matches the pattern in existing Signal #7. The error message is identical to examples already in Signal #7. Conclusion: NOT a new pattern. Add trace 045 to Signal #7.
versus:
New trace 051 shows agent exposing customer A's order to customer B. Existing signals: tool errors (#7), hallucinations (#12), latency (#3). None describe cross-customer data leakage. Conclusion: NEW pattern. Create signal about privacy violation
The system has learned to distinguish between "I've seen this before" and "this is genuinely new." That distinction is what makes the institutional memory useful rather than just noisy.
The First Stage: Programmatic Compression
Before any LLM even sees the data, we apply lossless compression that reduces size without losing information relevant to pattern detection. This stage is deterministic and fast.
Field whitelisting drops project metadata, user IDs, and timestamps while keeping only behavior-relevant fields. Tool definition deduplication replaces repeated tool schemas with "<same as previous>". Message accumulation detection recognizes that in stateful LLM interactions, messages accumulate predictably, so we compress repeated context into references.
The result: 25MB → ~7MB with zero information loss for pattern detection purposes. This doesn't sound like much, but it's enough to make the next stages tractable.
Real-World Validation
Building a system that detects unknown problems creates an obvious challenge: how do you validate that it works? You can't create a labeled test set of "unknown unknowns" because by definition, you don't know what they are. We needed a different approach.
The NVIDIA Stress Test
NVIDIA provided us with 25MB of production traces containing known issues, but critically, they didn't tell us what those issues were. The challenge was simple: Can your system surface our problems without any hints?
This became our primary validation dataset for v2.0 development. We iterated on prompt engineering, compression strategies, and pattern recognition until Signals reliably identified their problematic patterns autonomously. The test proved the system could work at enterprise data scales and find real issues that engineers cared about.
Production Deployments
Beyond controlled testing, we needed to see how Signals performed in actual production environments with real users. A large agent platform customer runs weekly automated batches, generating 200-300 signals per week. This proves Signals works as an operational practice, not just a one-off analysis tool.
Across our early access customers, we see hundreds of signals generated daily, spanning priority 1 (interesting but low-urgency patterns) to priority 10 (critical failures requiring immediate attention). The distribution tells us something important: most AI systems have a handful of critical issues and a long tail of optimization opportunities.
Patterns We've Actually Detected
The real validation comes from the kinds of patterns Signals surface in production. These are issues that wouldn't be caught by traditional metrics or manual inspection.
The Customer Data Leak (Priority 10): In a multi-turn airline agent conversation, the agent matched customers by name only rather than unique ID. This caused the agent to retrieve Customer A's booking history while responding to Customer B. The pattern appeared across three sessions before Signals caught it. The category is what we call a hybrid failure—simultaneously a security vulnerability and a hallucination, making it particularly insidious.
Tool Call Cascade (Priority 8): Each tool in a 5-step workflow had an acceptable ~5% failure rate when viewed in isolation. But across 100 workflows, 23 ended with incorrect output due to unrecovered tool failures. The pattern was invisible at the span level—each individual tool call looked fine—but obvious at the session level when you could see the cascade effect.
Policy Drift (Priority 7): A customer service agent gradually stopped following refund policies over a 2-week period. Week 1 showed strict adherence. Week 2 began approving gray areas. Week 3 approved an explicitly prohibited case. The metric "fail" came in week 3, but as a one-off could have easily been ignored as an outlier. Instead, Signals identified the drift over time and suggested a fix to reset compliance.
These examples illustrate what makes unknown unknowns hard to catch: they're either too subtle to trigger simple thresholds, too distributed to see in local views, or too gradual to notice in real-time monitoring.
TECHNICAL SIDEBAR: The Priority Triage System
Signals are priority-ranked 1-10 using a decision tree that considers multiple factors:
Priority 8-10 ("error"): Failures in application output
Privacy/compliance risks → 10
Monetary/reputational risks → 9
User frustration → 8
Priority 4-7 ("warning"): Recoverable issues, inefficiencies
Tool errors with recovery
Unnecessary tool usage
Inconsistent outputs
Priority 1-3 ("info"): Notable patterns, correct edge case handling
The scoring considers severity of impact, confidence in assessment, difficulty to discover manually, and actionability of the signal. The system fundamentally answers: "If the user never saw this, how bad would that be?" The highest priority signals represent the biggest loss from not knowing.
Closing the Loop: From Signals to Guardrails
Detecting unknown problems is valuable, but the real power comes from converting discoveries into ongoing monitoring. After identifying an important signal, users typically want to create an eval metric to track it over time. This used to require manually writing evaluation logic, which was tedious and error-prone.
We realized the same context used to identify the signal could be used to generate the eval automatically. This closes what we call the eval engineering loop: your observability system teaches itself what to watch for.
Here's how it works in practice:
Signals detects an unknown pattern: "LLM ignores error messages in tool calls"
User reviews the Signal, and clicks “Create Metric” from the Signal detail page
System generates custom eval checking for repeated tool calls with the same error
Today's unknown unknown becomes tomorrow's known guardrail
The system goes from discovering a problem to preventing future occurrences of that problem, with minimal manual effort. This is particularly powerful because it means your evaluation suite evolves with your system—as new failure modes emerge, they are automatically monitored.
Conclusion
Agentic systems are too complex for human supervision alone. You can write evals for what you know can fail, but the real risk is what you don't know—the subtle cascading failures, the edge case interactions, the security vulnerabilities that only emerge across hundreds of conversations.
Context engineering solves this by building systems that learn what to watch for, remember what they've seen, and get smarter with every analysis. It's not about replacing human judgment; it's about augmenting it with a system that can process scale and temporal patterns that humans simply can't track manually.
As agents become more complex and autonomous, quality assurance must become more autonomous as well. Signals represent our first step toward AI systems that can reliably supervise other AI systems at production scale. The techniques we've described—multi-stage compression, institutional memory, duplicate prevention, priority triage—form a foundation that will need to evolve as the systems we're monitoring become more sophisticated.
We'd love your feedback: Try Signals on your agent traces and let us know what patterns it surfaces. The more developers use it, the more we learn about what "unknown unknowns" actually look like in the wild, and the better we can make these systems at finding them.
Galileo Signals launches January 21, 2026.
What if your evaluation system got smarter after every failure?
We built something that shouldn't work: an AI system that maintains perfect memory of every issue it's ever detected across your entire agent infrastructure—then uses that knowledge to spot patterns you'd never think to look for.
Why shouldn't it work? Three fundamental constraints made this nearly impossible. First, LLM context windows are limited, and our test dataset was a 25MB file that crushed all of them. Second, simple memory solutions like RAG didn't capture the nuance of AI failure modes—you need to understand not just what happened, but how patterns evolve over time. Third, the cost at scale would explode exponentially across an enterprise customer base if we weren't careful about how we processed data.
This is why most LLM features are stateless. You ask, they answer, they forget. Galileo Signals takes a different approach, maintaining condensed institutional knowledge across weeks and months. Each new analysis builds on every previous finding. It's like having a senior engineer who's reviewed every trace your system has ever produced and can instantly recognize when a new problem matches an old pattern.
The core challenge we set out to solve was this: How do you detect "unknown unknowns" in agentic systems at production scale without exploding costs? Traditional observability is reactive—you write evals for what you know can fail. But agents fail in ways too subtle for human-defined metrics. For example, an agent might leak data between customers with similar names across multi-turn conversations. No metric catches this because you didn't know to look for it. No search query finds it because you don't know what to search for. But Signals does, because it's designed to find problems you don't know exist.
Why This Is Actually Hard
When we started building this, the naive approach seemed tempting: just send your logs to GPT and ask it what went wrong. This fails immediately for several reasons that aren't obvious until you try it.
First, stateless analysis means you get the same observations repeated every run. The LLM tells you about a tool error on Monday, then tells you about the same tool error on Friday, with no understanding that it has seen this pattern before. Second, there's no pattern recognition across time windows—you can't detect that an issue from Week 1 and Week 4 are manifestations of the same underlying problem. Third, running LLM inference per-trace results in a cost explosion, making the approach economically unviable at scale. And finally, you simply can't handle the data volume—25MB+ of trace data per run is too large to process in a single shot.
Our team quickly realized we were facing what became the most trendy problem of 2025: context engineering. We needed to achieve near-perfect compression of previous context while limiting the current batch to a representative sample over a variable time period. The challenge wasn't just making things smaller, it was preserving exactly the right information while discarding everything else.
The Architecture: Solving Compression at Three Levels
We solved this challenge with a multi-stage pipeline, where each stage addresses a specific constraint in the system.
Step 1.1 applies lossless programmatic compression: whitelisting relevant fields, deduplicating tool schemas, compressing repeated messages. This reduces the raw spans without sacrificing relevant information for pattern detection.
Step 1.2 uses an advanced reasoning model to distill each session into structured notes that capture "everything noteworthy" in dramatically less space (~500KB total). These notes preserve what matters—which patterns occurred and in which spans—while discarding verbosity.
Finally, Step 2 ingests all notes together (now small enough to fit in one context window) along with a historical summary of previously identified signals, and uses an LLM to perform cross-session pattern detection, generating up to 5 priority-ranked signal cards. This architecture maintains the critical "see everything at once" property needed for detecting systemic issues while working within the practical constraints of context windows and API costs.
The Full Pipeline
Raw Spans (25MB) ↓ [Step 1.1] Programmatic Compression ↓ Compressed Spans ↓ [Step 1.2] LLM Note-Taking ↓ Distilled Notes ↓ [Step 2] Signal Generation ↓ Signal Cards (5 max, priority-ranked)
The design choice here is deliberate: we use two-stage LLM processing only because context windows aren't infinite. Everything else is single-pass to maximize information flow between stages and ensure the system can draw connections across the entire dataset.
But compression alone doesn't solve the hardest part of the problem. The real breakthrough came from how we handle memory and pattern recognition over time.
Context Engineering: The Secret Sauce
To understand why our approach works, it helps to see what doesn't work. Here's how traditional "chat with logs" features operate:
User Query → LLM (fresh logs) → Answer → Forget Everything
Every interaction is independent. The LLM has no memory of what it told you yesterday, what patterns it's seen before, or how current behavior compares to historical baselines. You're starting from scratch every time.
Here's how Signals works differently:
def generate_insights(new_traces, historical_signals): # Load condensed knowledge from all previous runs context = compress_historical_patterns(historical_signals) # LLM gets BOTH new data AND institutional memory prompt = f""" Historical patterns detected: {context} New traces to analyze: {new_traces} Task: 1. If new traces match existing patterns, add to that signal 2. If genuinely novel pattern, create new signal 3. Do NOT duplicate existing signals """ return llm.analyze(prompt, model="claude-sonnet-4")
The key innovation is that the system maintains institutional memory. New traces automatically group with existing patterns. Signals evolve and compound over time. The LLM sees the full picture, not just today's data. This creates a fundamentally different capability—the system gets smarter with every run instead of starting fresh each time.
This approach required us to solve several subtle technical challenges that weren't obvious at the start.
Preventing Duplicate Detection
Having institutional memory creates a new problem: without careful prompt engineering, an LLM will generate "New signal: Tool call errors in checkout flow" on every run, even if it found this exact pattern 3 weeks ago. The LLM doesn't naturally understand that finding the same pattern again isn't news—it's confirmation.
Our solution has three parts. First, we pass ALL existing signal IDs and descriptions to the LLM in every request. Second, we give explicit instructions: "These patterns already exist - do not recreate them." Third, and most importantly, we ask the LLM to perform structured reasoning: "Is this trace a new manifestation of signal #47 or genuinely novel?"
This turns out to be non-trivial. The LLM must reason about similarity across different failure manifestations, since the same root cause can manifest differently across contexts. This requires balancing granularity (creating too many narrowly-defined signals) against aggregation (grouping too many distinct issues together).
Here's what the LLM's reasoning process looks like (from extended thinking):
New trace 045 shows agent calling get_weather with invalid format. This matches the pattern in existing Signal #7. The error message is identical to examples already in Signal #7. Conclusion: NOT a new pattern. Add trace 045 to Signal #7.
versus:
New trace 051 shows agent exposing customer A's order to customer B. Existing signals: tool errors (#7), hallucinations (#12), latency (#3). None describe cross-customer data leakage. Conclusion: NEW pattern. Create signal about privacy violation
The system has learned to distinguish between "I've seen this before" and "this is genuinely new." That distinction is what makes the institutional memory useful rather than just noisy.
The First Stage: Programmatic Compression
Before any LLM even sees the data, we apply lossless compression that reduces size without losing information relevant to pattern detection. This stage is deterministic and fast.
Field whitelisting drops project metadata, user IDs, and timestamps while keeping only behavior-relevant fields. Tool definition deduplication replaces repeated tool schemas with "<same as previous>". Message accumulation detection recognizes that in stateful LLM interactions, messages accumulate predictably, so we compress repeated context into references.
The result: 25MB → ~7MB with zero information loss for pattern detection purposes. This doesn't sound like much, but it's enough to make the next stages tractable.
Real-World Validation
Building a system that detects unknown problems creates an obvious challenge: how do you validate that it works? You can't create a labeled test set of "unknown unknowns" because by definition, you don't know what they are. We needed a different approach.
The NVIDIA Stress Test
NVIDIA provided us with 25MB of production traces containing known issues, but critically, they didn't tell us what those issues were. The challenge was simple: Can your system surface our problems without any hints?
This became our primary validation dataset for v2.0 development. We iterated on prompt engineering, compression strategies, and pattern recognition until Signals reliably identified their problematic patterns autonomously. The test proved the system could work at enterprise data scales and find real issues that engineers cared about.
Production Deployments
Beyond controlled testing, we needed to see how Signals performed in actual production environments with real users. A large agent platform customer runs weekly automated batches, generating 200-300 signals per week. This proves Signals works as an operational practice, not just a one-off analysis tool.
Across our early access customers, we see hundreds of signals generated daily, spanning priority 1 (interesting but low-urgency patterns) to priority 10 (critical failures requiring immediate attention). The distribution tells us something important: most AI systems have a handful of critical issues and a long tail of optimization opportunities.
Patterns We've Actually Detected
The real validation comes from the kinds of patterns Signals surface in production. These are issues that wouldn't be caught by traditional metrics or manual inspection.
The Customer Data Leak (Priority 10): In a multi-turn airline agent conversation, the agent matched customers by name only rather than unique ID. This caused the agent to retrieve Customer A's booking history while responding to Customer B. The pattern appeared across three sessions before Signals caught it. The category is what we call a hybrid failure—simultaneously a security vulnerability and a hallucination, making it particularly insidious.
Tool Call Cascade (Priority 8): Each tool in a 5-step workflow had an acceptable ~5% failure rate when viewed in isolation. But across 100 workflows, 23 ended with incorrect output due to unrecovered tool failures. The pattern was invisible at the span level—each individual tool call looked fine—but obvious at the session level when you could see the cascade effect.
Policy Drift (Priority 7): A customer service agent gradually stopped following refund policies over a 2-week period. Week 1 showed strict adherence. Week 2 began approving gray areas. Week 3 approved an explicitly prohibited case. The metric "fail" came in week 3, but as a one-off could have easily been ignored as an outlier. Instead, Signals identified the drift over time and suggested a fix to reset compliance.
These examples illustrate what makes unknown unknowns hard to catch: they're either too subtle to trigger simple thresholds, too distributed to see in local views, or too gradual to notice in real-time monitoring.
TECHNICAL SIDEBAR: The Priority Triage System
Signals are priority-ranked 1-10 using a decision tree that considers multiple factors:
Priority 8-10 ("error"): Failures in application output
Privacy/compliance risks → 10
Monetary/reputational risks → 9
User frustration → 8
Priority 4-7 ("warning"): Recoverable issues, inefficiencies
Tool errors with recovery
Unnecessary tool usage
Inconsistent outputs
Priority 1-3 ("info"): Notable patterns, correct edge case handling
The scoring considers severity of impact, confidence in assessment, difficulty to discover manually, and actionability of the signal. The system fundamentally answers: "If the user never saw this, how bad would that be?" The highest priority signals represent the biggest loss from not knowing.
Closing the Loop: From Signals to Guardrails
Detecting unknown problems is valuable, but the real power comes from converting discoveries into ongoing monitoring. After identifying an important signal, users typically want to create an eval metric to track it over time. This used to require manually writing evaluation logic, which was tedious and error-prone.
We realized the same context used to identify the signal could be used to generate the eval automatically. This closes what we call the eval engineering loop: your observability system teaches itself what to watch for.
Here's how it works in practice:
Signals detects an unknown pattern: "LLM ignores error messages in tool calls"
User reviews the Signal, and clicks “Create Metric” from the Signal detail page
System generates custom eval checking for repeated tool calls with the same error
Today's unknown unknown becomes tomorrow's known guardrail
The system goes from discovering a problem to preventing future occurrences of that problem, with minimal manual effort. This is particularly powerful because it means your evaluation suite evolves with your system—as new failure modes emerge, they are automatically monitored.
Conclusion
Agentic systems are too complex for human supervision alone. You can write evals for what you know can fail, but the real risk is what you don't know—the subtle cascading failures, the edge case interactions, the security vulnerabilities that only emerge across hundreds of conversations.
Context engineering solves this by building systems that learn what to watch for, remember what they've seen, and get smarter with every analysis. It's not about replacing human judgment; it's about augmenting it with a system that can process scale and temporal patterns that humans simply can't track manually.
As agents become more complex and autonomous, quality assurance must become more autonomous as well. Signals represent our first step toward AI systems that can reliably supervise other AI systems at production scale. The techniques we've described—multi-stage compression, institutional memory, duplicate prevention, priority triage—form a foundation that will need to evolve as the systems we're monitoring become more sophisticated.
We'd love your feedback: Try Signals on your agent traces and let us know what patterns it surfaces. The more developers use it, the more we learn about what "unknown unknowns" actually look like in the wild, and the better we can make these systems at finding them.
Galileo Signals launches January 21, 2026.
What if your evaluation system got smarter after every failure?
We built something that shouldn't work: an AI system that maintains perfect memory of every issue it's ever detected across your entire agent infrastructure—then uses that knowledge to spot patterns you'd never think to look for.
Why shouldn't it work? Three fundamental constraints made this nearly impossible. First, LLM context windows are limited, and our test dataset was a 25MB file that crushed all of them. Second, simple memory solutions like RAG didn't capture the nuance of AI failure modes—you need to understand not just what happened, but how patterns evolve over time. Third, the cost at scale would explode exponentially across an enterprise customer base if we weren't careful about how we processed data.
This is why most LLM features are stateless. You ask, they answer, they forget. Galileo Signals takes a different approach, maintaining condensed institutional knowledge across weeks and months. Each new analysis builds on every previous finding. It's like having a senior engineer who's reviewed every trace your system has ever produced and can instantly recognize when a new problem matches an old pattern.
The core challenge we set out to solve was this: How do you detect "unknown unknowns" in agentic systems at production scale without exploding costs? Traditional observability is reactive—you write evals for what you know can fail. But agents fail in ways too subtle for human-defined metrics. For example, an agent might leak data between customers with similar names across multi-turn conversations. No metric catches this because you didn't know to look for it. No search query finds it because you don't know what to search for. But Signals does, because it's designed to find problems you don't know exist.
Why This Is Actually Hard
When we started building this, the naive approach seemed tempting: just send your logs to GPT and ask it what went wrong. This fails immediately for several reasons that aren't obvious until you try it.
First, stateless analysis means you get the same observations repeated every run. The LLM tells you about a tool error on Monday, then tells you about the same tool error on Friday, with no understanding that it has seen this pattern before. Second, there's no pattern recognition across time windows—you can't detect that an issue from Week 1 and Week 4 are manifestations of the same underlying problem. Third, running LLM inference per-trace results in a cost explosion, making the approach economically unviable at scale. And finally, you simply can't handle the data volume—25MB+ of trace data per run is too large to process in a single shot.
Our team quickly realized we were facing what became the most trendy problem of 2025: context engineering. We needed to achieve near-perfect compression of previous context while limiting the current batch to a representative sample over a variable time period. The challenge wasn't just making things smaller, it was preserving exactly the right information while discarding everything else.
The Architecture: Solving Compression at Three Levels
We solved this challenge with a multi-stage pipeline, where each stage addresses a specific constraint in the system.
Step 1.1 applies lossless programmatic compression: whitelisting relevant fields, deduplicating tool schemas, compressing repeated messages. This reduces the raw spans without sacrificing relevant information for pattern detection.
Step 1.2 uses an advanced reasoning model to distill each session into structured notes that capture "everything noteworthy" in dramatically less space (~500KB total). These notes preserve what matters—which patterns occurred and in which spans—while discarding verbosity.
Finally, Step 2 ingests all notes together (now small enough to fit in one context window) along with a historical summary of previously identified signals, and uses an LLM to perform cross-session pattern detection, generating up to 5 priority-ranked signal cards. This architecture maintains the critical "see everything at once" property needed for detecting systemic issues while working within the practical constraints of context windows and API costs.
The Full Pipeline
Raw Spans (25MB) ↓ [Step 1.1] Programmatic Compression ↓ Compressed Spans ↓ [Step 1.2] LLM Note-Taking ↓ Distilled Notes ↓ [Step 2] Signal Generation ↓ Signal Cards (5 max, priority-ranked)
The design choice here is deliberate: we use two-stage LLM processing only because context windows aren't infinite. Everything else is single-pass to maximize information flow between stages and ensure the system can draw connections across the entire dataset.
But compression alone doesn't solve the hardest part of the problem. The real breakthrough came from how we handle memory and pattern recognition over time.
Context Engineering: The Secret Sauce
To understand why our approach works, it helps to see what doesn't work. Here's how traditional "chat with logs" features operate:
User Query → LLM (fresh logs) → Answer → Forget Everything
Every interaction is independent. The LLM has no memory of what it told you yesterday, what patterns it's seen before, or how current behavior compares to historical baselines. You're starting from scratch every time.
Here's how Signals works differently:
def generate_insights(new_traces, historical_signals): # Load condensed knowledge from all previous runs context = compress_historical_patterns(historical_signals) # LLM gets BOTH new data AND institutional memory prompt = f""" Historical patterns detected: {context} New traces to analyze: {new_traces} Task: 1. If new traces match existing patterns, add to that signal 2. If genuinely novel pattern, create new signal 3. Do NOT duplicate existing signals """ return llm.analyze(prompt, model="claude-sonnet-4")
The key innovation is that the system maintains institutional memory. New traces automatically group with existing patterns. Signals evolve and compound over time. The LLM sees the full picture, not just today's data. This creates a fundamentally different capability—the system gets smarter with every run instead of starting fresh each time.
This approach required us to solve several subtle technical challenges that weren't obvious at the start.
Preventing Duplicate Detection
Having institutional memory creates a new problem: without careful prompt engineering, an LLM will generate "New signal: Tool call errors in checkout flow" on every run, even if it found this exact pattern 3 weeks ago. The LLM doesn't naturally understand that finding the same pattern again isn't news—it's confirmation.
Our solution has three parts. First, we pass ALL existing signal IDs and descriptions to the LLM in every request. Second, we give explicit instructions: "These patterns already exist - do not recreate them." Third, and most importantly, we ask the LLM to perform structured reasoning: "Is this trace a new manifestation of signal #47 or genuinely novel?"
This turns out to be non-trivial. The LLM must reason about similarity across different failure manifestations, since the same root cause can manifest differently across contexts. This requires balancing granularity (creating too many narrowly-defined signals) against aggregation (grouping too many distinct issues together).
Here's what the LLM's reasoning process looks like (from extended thinking):
New trace 045 shows agent calling get_weather with invalid format. This matches the pattern in existing Signal #7. The error message is identical to examples already in Signal #7. Conclusion: NOT a new pattern. Add trace 045 to Signal #7.
versus:
New trace 051 shows agent exposing customer A's order to customer B. Existing signals: tool errors (#7), hallucinations (#12), latency (#3). None describe cross-customer data leakage. Conclusion: NEW pattern. Create signal about privacy violation
The system has learned to distinguish between "I've seen this before" and "this is genuinely new." That distinction is what makes the institutional memory useful rather than just noisy.
The First Stage: Programmatic Compression
Before any LLM even sees the data, we apply lossless compression that reduces size without losing information relevant to pattern detection. This stage is deterministic and fast.
Field whitelisting drops project metadata, user IDs, and timestamps while keeping only behavior-relevant fields. Tool definition deduplication replaces repeated tool schemas with "<same as previous>". Message accumulation detection recognizes that in stateful LLM interactions, messages accumulate predictably, so we compress repeated context into references.
The result: 25MB → ~7MB with zero information loss for pattern detection purposes. This doesn't sound like much, but it's enough to make the next stages tractable.
Real-World Validation
Building a system that detects unknown problems creates an obvious challenge: how do you validate that it works? You can't create a labeled test set of "unknown unknowns" because by definition, you don't know what they are. We needed a different approach.
The NVIDIA Stress Test
NVIDIA provided us with 25MB of production traces containing known issues, but critically, they didn't tell us what those issues were. The challenge was simple: Can your system surface our problems without any hints?
This became our primary validation dataset for v2.0 development. We iterated on prompt engineering, compression strategies, and pattern recognition until Signals reliably identified their problematic patterns autonomously. The test proved the system could work at enterprise data scales and find real issues that engineers cared about.
Production Deployments
Beyond controlled testing, we needed to see how Signals performed in actual production environments with real users. A large agent platform customer runs weekly automated batches, generating 200-300 signals per week. This proves Signals works as an operational practice, not just a one-off analysis tool.
Across our early access customers, we see hundreds of signals generated daily, spanning priority 1 (interesting but low-urgency patterns) to priority 10 (critical failures requiring immediate attention). The distribution tells us something important: most AI systems have a handful of critical issues and a long tail of optimization opportunities.
Patterns We've Actually Detected
The real validation comes from the kinds of patterns Signals surface in production. These are issues that wouldn't be caught by traditional metrics or manual inspection.
The Customer Data Leak (Priority 10): In a multi-turn airline agent conversation, the agent matched customers by name only rather than unique ID. This caused the agent to retrieve Customer A's booking history while responding to Customer B. The pattern appeared across three sessions before Signals caught it. The category is what we call a hybrid failure—simultaneously a security vulnerability and a hallucination, making it particularly insidious.
Tool Call Cascade (Priority 8): Each tool in a 5-step workflow had an acceptable ~5% failure rate when viewed in isolation. But across 100 workflows, 23 ended with incorrect output due to unrecovered tool failures. The pattern was invisible at the span level—each individual tool call looked fine—but obvious at the session level when you could see the cascade effect.
Policy Drift (Priority 7): A customer service agent gradually stopped following refund policies over a 2-week period. Week 1 showed strict adherence. Week 2 began approving gray areas. Week 3 approved an explicitly prohibited case. The metric "fail" came in week 3, but as a one-off could have easily been ignored as an outlier. Instead, Signals identified the drift over time and suggested a fix to reset compliance.
These examples illustrate what makes unknown unknowns hard to catch: they're either too subtle to trigger simple thresholds, too distributed to see in local views, or too gradual to notice in real-time monitoring.
TECHNICAL SIDEBAR: The Priority Triage System
Signals are priority-ranked 1-10 using a decision tree that considers multiple factors:
Priority 8-10 ("error"): Failures in application output
Privacy/compliance risks → 10
Monetary/reputational risks → 9
User frustration → 8
Priority 4-7 ("warning"): Recoverable issues, inefficiencies
Tool errors with recovery
Unnecessary tool usage
Inconsistent outputs
Priority 1-3 ("info"): Notable patterns, correct edge case handling
The scoring considers severity of impact, confidence in assessment, difficulty to discover manually, and actionability of the signal. The system fundamentally answers: "If the user never saw this, how bad would that be?" The highest priority signals represent the biggest loss from not knowing.
Closing the Loop: From Signals to Guardrails
Detecting unknown problems is valuable, but the real power comes from converting discoveries into ongoing monitoring. After identifying an important signal, users typically want to create an eval metric to track it over time. This used to require manually writing evaluation logic, which was tedious and error-prone.
We realized the same context used to identify the signal could be used to generate the eval automatically. This closes what we call the eval engineering loop: your observability system teaches itself what to watch for.
Here's how it works in practice:
Signals detects an unknown pattern: "LLM ignores error messages in tool calls"
User reviews the Signal, and clicks “Create Metric” from the Signal detail page
System generates custom eval checking for repeated tool calls with the same error
Today's unknown unknown becomes tomorrow's known guardrail
The system goes from discovering a problem to preventing future occurrences of that problem, with minimal manual effort. This is particularly powerful because it means your evaluation suite evolves with your system—as new failure modes emerge, they are automatically monitored.
Conclusion
Agentic systems are too complex for human supervision alone. You can write evals for what you know can fail, but the real risk is what you don't know—the subtle cascading failures, the edge case interactions, the security vulnerabilities that only emerge across hundreds of conversations.
Context engineering solves this by building systems that learn what to watch for, remember what they've seen, and get smarter with every analysis. It's not about replacing human judgment; it's about augmenting it with a system that can process scale and temporal patterns that humans simply can't track manually.
As agents become more complex and autonomous, quality assurance must become more autonomous as well. Signals represent our first step toward AI systems that can reliably supervise other AI systems at production scale. The techniques we've described—multi-stage compression, institutional memory, duplicate prevention, priority triage—form a foundation that will need to evolve as the systems we're monitoring become more sophisticated.
We'd love your feedback: Try Signals on your agent traces and let us know what patterns it surfaces. The more developers use it, the more we learn about what "unknown unknowns" actually look like in the wild, and the better we can make these systems at finding them.
Galileo Signals launches January 21, 2026.
If you find this helpful and interesting,


Bipin Shetty