Jan 21, 2026

Context Engineering at Scale: How We Built Galileo Signals

Bipin Shetty

AI/Software Engineer

Bipin Shetty

AI/Software Engineer

What if your evaluation system got smarter after every failure?

We built something that shouldn't work: an AI system that maintains perfect memory of every issue it's ever detected across your entire agent infrastructure—then uses that knowledge to spot patterns you'd never think to look for.

Why shouldn't it work? Three fundamental constraints made this nearly impossible. First, LLM context windows are limited, and our test dataset was a 25MB file that crushed all of them. Second, simple memory solutions like RAG didn't capture the nuance of AI failure modes—you need to understand not just what happened, but how patterns evolve over time. Third, the cost at scale would explode exponentially across an enterprise customer base if we weren't careful about how we processed data.

This is why most LLM features are stateless. You ask, they answer, they forget. Galileo Signals takes a different approach, maintaining condensed institutional knowledge across weeks and months. Each new analysis builds on every previous finding. It's like having a senior engineer who's reviewed every trace your system has ever produced and can instantly recognize when a new problem matches an old pattern.

The core challenge we set out to solve was this: How do you detect "unknown unknowns" in agentic systems at production scale without exploding costs? Traditional observability is reactive—you write evals for what you know can fail. But agents fail in ways too subtle for human-defined metrics. For example, an agent might leak data between customers with similar names across multi-turn conversations. No metric catches this because you didn't know to look for it. No search query finds it because you don't know what to search for. But Signals does, because it's designed to find problems you don't know exist.

Why This Is Actually Hard

When we started building this, the naive approach seemed tempting: just send your logs to GPT and ask it what went wrong. This fails immediately for several reasons that aren't obvious until you try it.

First, stateless analysis means you get the same observations repeated every run. The LLM tells you about a tool error on Monday, then tells you about the same tool error on Friday, with no understanding that it has seen this pattern before. Second, there's no pattern recognition across time windows—you can't detect that an issue from Week 1 and Week 4 are manifestations of the same underlying problem. Third, running LLM inference per-trace results in a cost explosion, making the approach economically unviable at scale. And finally, you simply can't handle the data volume—25MB+ of trace data per run is too large to process in a single shot.

Our team quickly realized we were facing what became the most trendy problem of 2025: context engineering. We needed to achieve near-perfect compression of previous context while limiting the current batch to a representative sample over a variable time period. The challenge wasn't just making things smaller, it was preserving exactly the right information while discarding everything else.

The Architecture: Solving Compression at Three Levels

We solved this challenge with a multi-stage pipeline, where each stage addresses a specific constraint in the system.

Step 1.1 applies lossless programmatic compression: whitelisting relevant fields, deduplicating tool schemas, compressing repeated messages. This reduces the raw spans without sacrificing relevant information for pattern detection. 

Step 1.2 uses an advanced reasoning model to distill each session into structured notes that capture "everything noteworthy" in dramatically less space (~500KB total). These notes preserve what matters—which patterns occurred and in which spans—while discarding verbosity.

Finally, Step 2 ingests all notes together (now small enough to fit in one context window) along with a historical summary of previously identified signals, and uses an LLM to perform cross-session pattern detection, generating up to 5 priority-ranked signal cards. This architecture maintains the critical "see everything at once" property needed for detecting systemic issues while working within the practical constraints of context windows and API costs.

The Full Pipeline

Raw Spans (25MB)
    
[Step 1.1] Programmatic Compression
    
Compressed Spans
    
[Step 1.2] LLM Note-Taking
    
Distilled Notes 
    
[Step 2] Signal Generation 
    
Signal Cards (5 max, priority-ranked)

The design choice here is deliberate: we use two-stage LLM processing only because context windows aren't infinite. Everything else is single-pass to maximize information flow between stages and ensure the system can draw connections across the entire dataset.

But compression alone doesn't solve the hardest part of the problem. The real breakthrough came from how we handle memory and pattern recognition over time.

Context Engineering: The Secret Sauce

To understand why our approach works, it helps to see what doesn't work. Here's how traditional "chat with logs" features operate:

User Query LLM (fresh logs) Answer Forget Everything

Every interaction is independent. The LLM has no memory of what it told you yesterday, what patterns it's seen before, or how current behavior compares to historical baselines. You're starting from scratch every time.

Here's how Signals works differently:

def generate_insights(new_traces, historical_signals):
    # Load condensed knowledge from all previous runs
    context = compress_historical_patterns(historical_signals)
    
    # LLM gets BOTH new data AND institutional memory
    prompt = f"""
    Historical patterns detected:
    {context}
    
    New traces to analyze:
    {new_traces}
    
    Task: 
    1. If new traces match existing patterns, add to that signal
    2. If genuinely novel pattern, create new signal
    3. Do NOT duplicate existing signals
    """
    
    return llm.analyze(prompt, model="claude-sonnet-4")

The key innovation is that the system maintains institutional memory. New traces automatically group with existing patterns. Signals evolve and compound over time. The LLM sees the full picture, not just today's data. This creates a fundamentally different capability—the system gets smarter with every run instead of starting fresh each time.

This approach required us to solve several subtle technical challenges that weren't obvious at the start.

Preventing Duplicate Detection

Having institutional memory creates a new problem: without careful prompt engineering, an LLM will generate "New signal: Tool call errors in checkout flow" on every run, even if it found this exact pattern 3 weeks ago. The LLM doesn't naturally understand that finding the same pattern again isn't news—it's confirmation.

Our solution has three parts. First, we pass ALL existing signal IDs and descriptions to the LLM in every request. Second, we give explicit instructions: "These patterns already exist - do not recreate them." Third, and most importantly, we ask the LLM to perform structured reasoning: "Is this trace a new manifestation of signal #47 or genuinely novel?"

This turns out to be non-trivial. The LLM must reason about similarity across different failure manifestations, since the same root cause can manifest differently across contexts. This requires balancing granularity (creating too many narrowly-defined signals) against aggregation (grouping too many distinct issues together).

Here's what the LLM's reasoning process looks like (from extended thinking):

New trace 045 shows agent calling get_weather with invalid format.
This matches the pattern in existing Signal #7.
The error message is identical to examples already in Signal #7.
Conclusion: NOT a new pattern. Add trace 045 to Signal #7.

versus:

New trace 051 shows agent exposing customer A's order to customer B.

Existing signals: tool errors (#7), hallucinations (#12), latency (#3).

None describe cross-customer data leakage.

Conclusion: NEW pattern. Create signal about privacy violation

The system has learned to distinguish between "I've seen this before" and "this is genuinely new." That distinction is what makes the institutional memory useful rather than just noisy.

The First Stage: Programmatic Compression

Before any LLM even sees the data, we apply lossless compression that reduces size without losing information relevant to pattern detection. This stage is deterministic and fast.

Field whitelisting drops project metadata, user IDs, and timestamps while keeping only behavior-relevant fields. Tool definition deduplication replaces repeated tool schemas with "<same as previous>". Message accumulation detection recognizes that in stateful LLM interactions, messages accumulate predictably, so we compress repeated context into references.

The result: 25MB → ~7MB with zero information loss for pattern detection purposes. This doesn't sound like much, but it's enough to make the next stages tractable.

Real-World Validation

Building a system that detects unknown problems creates an obvious challenge: how do you validate that it works? You can't create a labeled test set of "unknown unknowns" because by definition, you don't know what they are. We needed a different approach.

The NVIDIA Stress Test

NVIDIA provided us with 25MB of production traces containing known issues, but critically, they didn't tell us what those issues were. The challenge was simple: Can your system surface our problems without any hints?

This became our primary validation dataset for v2.0 development. We iterated on prompt engineering, compression strategies, and pattern recognition until Signals reliably identified their problematic patterns autonomously. The test proved the system could work at enterprise data scales and find real issues that engineers cared about.

Production Deployments

Beyond controlled testing, we needed to see how Signals performed in actual production environments with real users. A large agent platform customer runs weekly automated batches, generating 200-300 signals per week. This proves Signals works as an operational practice, not just a one-off analysis tool.

Across our early access customers, we see hundreds of signals generated daily, spanning priority 1 (interesting but low-urgency patterns) to priority 10 (critical failures requiring immediate attention). The distribution tells us something important: most AI systems have a handful of critical issues and a long tail of optimization opportunities.

Patterns We've Actually Detected

The real validation comes from the kinds of patterns Signals surface in production. These are issues that wouldn't be caught by traditional metrics or manual inspection.

The Customer Data Leak (Priority 10): In a multi-turn airline agent conversation, the agent matched customers by name only rather than unique ID. This caused the agent to retrieve Customer A's booking history while responding to Customer B. The pattern appeared across three sessions before Signals caught it. The category is what we call a hybrid failure—simultaneously a security vulnerability and a hallucination, making it particularly insidious.

Tool Call Cascade (Priority 8): Each tool in a 5-step workflow had an acceptable ~5% failure rate when viewed in isolation. But across 100 workflows, 23 ended with incorrect output due to unrecovered tool failures. The pattern was invisible at the span level—each individual tool call looked fine—but obvious at the session level when you could see the cascade effect.

Policy Drift (Priority 7): A customer service agent gradually stopped following refund policies over a 2-week period. Week 1 showed strict adherence. Week 2 began approving gray areas. Week 3 approved an explicitly prohibited case. The metric "fail" came in week 3, but as a one-off could have easily been ignored as an outlier. Instead, Signals identified the drift over time and suggested a fix to reset compliance.

These examples illustrate what makes unknown unknowns hard to catch: they're either too subtle to trigger simple thresholds, too distributed to see in local views, or too gradual to notice in real-time monitoring.

TECHNICAL SIDEBAR: The Priority Triage System

Signals are priority-ranked 1-10 using a decision tree that considers multiple factors:

Priority 8-10 ("error"): Failures in application output

  • Privacy/compliance risks → 10

  • Monetary/reputational risks → 9

  • User frustration → 8

Priority 4-7 ("warning"): Recoverable issues, inefficiencies

  • Tool errors with recovery

  • Unnecessary tool usage

  • Inconsistent outputs

Priority 1-3 ("info"): Notable patterns, correct edge case handling

The scoring considers severity of impact, confidence in assessment, difficulty to discover manually, and actionability of the signal. The system fundamentally answers: "If the user never saw this, how bad would that be?" The highest priority signals represent the biggest loss from not knowing.

Closing the Loop: From Signals to Guardrails

Detecting unknown problems is valuable, but the real power comes from converting discoveries into ongoing monitoring. After identifying an important signal, users typically want to create an eval metric to track it over time. This used to require manually writing evaluation logic, which was tedious and error-prone.

We realized the same context used to identify the signal could be used to generate the eval automatically. This closes what we call the eval engineering loop: your observability system teaches itself what to watch for.

Here's how it works in practice:

  1. Signals detects an unknown pattern: "LLM ignores error messages in tool calls"

  2. User reviews the Signal, and clicks “Create Metric” from the Signal detail page

  3. System generates custom eval checking for repeated tool calls with the same error

  4. Today's unknown unknown becomes tomorrow's known guardrail

The system goes from discovering a problem to preventing future occurrences of that problem, with minimal manual effort. This is particularly powerful because it means your evaluation suite evolves with your system—as new failure modes emerge, they are automatically monitored.

Conclusion

Agentic systems are too complex for human supervision alone. You can write evals for what you know can fail, but the real risk is what you don't know—the subtle cascading failures, the edge case interactions, the security vulnerabilities that only emerge across hundreds of conversations.

Context engineering solves this by building systems that learn what to watch for, remember what they've seen, and get smarter with every analysis. It's not about replacing human judgment; it's about augmenting it with a system that can process scale and temporal patterns that humans simply can't track manually.

As agents become more complex and autonomous, quality assurance must become more autonomous as well. Signals represent our first step toward AI systems that can reliably supervise other AI systems at production scale. The techniques we've described—multi-stage compression, institutional memory, duplicate prevention, priority triage—form a foundation that will need to evolve as the systems we're monitoring become more sophisticated.

We'd love your feedback: Try Signals on your agent traces and let us know what patterns it surfaces. The more developers use it, the more we learn about what "unknown unknowns" actually look like in the wild, and the better we can make these systems at finding them.

Galileo Signals launches January 21, 2026.

What if your evaluation system got smarter after every failure?

We built something that shouldn't work: an AI system that maintains perfect memory of every issue it's ever detected across your entire agent infrastructure—then uses that knowledge to spot patterns you'd never think to look for.

Why shouldn't it work? Three fundamental constraints made this nearly impossible. First, LLM context windows are limited, and our test dataset was a 25MB file that crushed all of them. Second, simple memory solutions like RAG didn't capture the nuance of AI failure modes—you need to understand not just what happened, but how patterns evolve over time. Third, the cost at scale would explode exponentially across an enterprise customer base if we weren't careful about how we processed data.

This is why most LLM features are stateless. You ask, they answer, they forget. Galileo Signals takes a different approach, maintaining condensed institutional knowledge across weeks and months. Each new analysis builds on every previous finding. It's like having a senior engineer who's reviewed every trace your system has ever produced and can instantly recognize when a new problem matches an old pattern.

The core challenge we set out to solve was this: How do you detect "unknown unknowns" in agentic systems at production scale without exploding costs? Traditional observability is reactive—you write evals for what you know can fail. But agents fail in ways too subtle for human-defined metrics. For example, an agent might leak data between customers with similar names across multi-turn conversations. No metric catches this because you didn't know to look for it. No search query finds it because you don't know what to search for. But Signals does, because it's designed to find problems you don't know exist.

Why This Is Actually Hard

When we started building this, the naive approach seemed tempting: just send your logs to GPT and ask it what went wrong. This fails immediately for several reasons that aren't obvious until you try it.

First, stateless analysis means you get the same observations repeated every run. The LLM tells you about a tool error on Monday, then tells you about the same tool error on Friday, with no understanding that it has seen this pattern before. Second, there's no pattern recognition across time windows—you can't detect that an issue from Week 1 and Week 4 are manifestations of the same underlying problem. Third, running LLM inference per-trace results in a cost explosion, making the approach economically unviable at scale. And finally, you simply can't handle the data volume—25MB+ of trace data per run is too large to process in a single shot.

Our team quickly realized we were facing what became the most trendy problem of 2025: context engineering. We needed to achieve near-perfect compression of previous context while limiting the current batch to a representative sample over a variable time period. The challenge wasn't just making things smaller, it was preserving exactly the right information while discarding everything else.

The Architecture: Solving Compression at Three Levels

We solved this challenge with a multi-stage pipeline, where each stage addresses a specific constraint in the system.

Step 1.1 applies lossless programmatic compression: whitelisting relevant fields, deduplicating tool schemas, compressing repeated messages. This reduces the raw spans without sacrificing relevant information for pattern detection. 

Step 1.2 uses an advanced reasoning model to distill each session into structured notes that capture "everything noteworthy" in dramatically less space (~500KB total). These notes preserve what matters—which patterns occurred and in which spans—while discarding verbosity.

Finally, Step 2 ingests all notes together (now small enough to fit in one context window) along with a historical summary of previously identified signals, and uses an LLM to perform cross-session pattern detection, generating up to 5 priority-ranked signal cards. This architecture maintains the critical "see everything at once" property needed for detecting systemic issues while working within the practical constraints of context windows and API costs.

The Full Pipeline

Raw Spans (25MB)
    
[Step 1.1] Programmatic Compression
    
Compressed Spans
    
[Step 1.2] LLM Note-Taking
    
Distilled Notes 
    
[Step 2] Signal Generation 
    
Signal Cards (5 max, priority-ranked)

The design choice here is deliberate: we use two-stage LLM processing only because context windows aren't infinite. Everything else is single-pass to maximize information flow between stages and ensure the system can draw connections across the entire dataset.

But compression alone doesn't solve the hardest part of the problem. The real breakthrough came from how we handle memory and pattern recognition over time.

Context Engineering: The Secret Sauce

To understand why our approach works, it helps to see what doesn't work. Here's how traditional "chat with logs" features operate:

User Query LLM (fresh logs) Answer Forget Everything

Every interaction is independent. The LLM has no memory of what it told you yesterday, what patterns it's seen before, or how current behavior compares to historical baselines. You're starting from scratch every time.

Here's how Signals works differently:

def generate_insights(new_traces, historical_signals):
    # Load condensed knowledge from all previous runs
    context = compress_historical_patterns(historical_signals)
    
    # LLM gets BOTH new data AND institutional memory
    prompt = f"""
    Historical patterns detected:
    {context}
    
    New traces to analyze:
    {new_traces}
    
    Task: 
    1. If new traces match existing patterns, add to that signal
    2. If genuinely novel pattern, create new signal
    3. Do NOT duplicate existing signals
    """
    
    return llm.analyze(prompt, model="claude-sonnet-4")

The key innovation is that the system maintains institutional memory. New traces automatically group with existing patterns. Signals evolve and compound over time. The LLM sees the full picture, not just today's data. This creates a fundamentally different capability—the system gets smarter with every run instead of starting fresh each time.

This approach required us to solve several subtle technical challenges that weren't obvious at the start.

Preventing Duplicate Detection

Having institutional memory creates a new problem: without careful prompt engineering, an LLM will generate "New signal: Tool call errors in checkout flow" on every run, even if it found this exact pattern 3 weeks ago. The LLM doesn't naturally understand that finding the same pattern again isn't news—it's confirmation.

Our solution has three parts. First, we pass ALL existing signal IDs and descriptions to the LLM in every request. Second, we give explicit instructions: "These patterns already exist - do not recreate them." Third, and most importantly, we ask the LLM to perform structured reasoning: "Is this trace a new manifestation of signal #47 or genuinely novel?"

This turns out to be non-trivial. The LLM must reason about similarity across different failure manifestations, since the same root cause can manifest differently across contexts. This requires balancing granularity (creating too many narrowly-defined signals) against aggregation (grouping too many distinct issues together).

Here's what the LLM's reasoning process looks like (from extended thinking):

New trace 045 shows agent calling get_weather with invalid format.
This matches the pattern in existing Signal #7.
The error message is identical to examples already in Signal #7.
Conclusion: NOT a new pattern. Add trace 045 to Signal #7.

versus:

New trace 051 shows agent exposing customer A's order to customer B.

Existing signals: tool errors (#7), hallucinations (#12), latency (#3).

None describe cross-customer data leakage.

Conclusion: NEW pattern. Create signal about privacy violation

The system has learned to distinguish between "I've seen this before" and "this is genuinely new." That distinction is what makes the institutional memory useful rather than just noisy.

The First Stage: Programmatic Compression

Before any LLM even sees the data, we apply lossless compression that reduces size without losing information relevant to pattern detection. This stage is deterministic and fast.

Field whitelisting drops project metadata, user IDs, and timestamps while keeping only behavior-relevant fields. Tool definition deduplication replaces repeated tool schemas with "<same as previous>". Message accumulation detection recognizes that in stateful LLM interactions, messages accumulate predictably, so we compress repeated context into references.

The result: 25MB → ~7MB with zero information loss for pattern detection purposes. This doesn't sound like much, but it's enough to make the next stages tractable.

Real-World Validation

Building a system that detects unknown problems creates an obvious challenge: how do you validate that it works? You can't create a labeled test set of "unknown unknowns" because by definition, you don't know what they are. We needed a different approach.

The NVIDIA Stress Test

NVIDIA provided us with 25MB of production traces containing known issues, but critically, they didn't tell us what those issues were. The challenge was simple: Can your system surface our problems without any hints?

This became our primary validation dataset for v2.0 development. We iterated on prompt engineering, compression strategies, and pattern recognition until Signals reliably identified their problematic patterns autonomously. The test proved the system could work at enterprise data scales and find real issues that engineers cared about.

Production Deployments

Beyond controlled testing, we needed to see how Signals performed in actual production environments with real users. A large agent platform customer runs weekly automated batches, generating 200-300 signals per week. This proves Signals works as an operational practice, not just a one-off analysis tool.

Across our early access customers, we see hundreds of signals generated daily, spanning priority 1 (interesting but low-urgency patterns) to priority 10 (critical failures requiring immediate attention). The distribution tells us something important: most AI systems have a handful of critical issues and a long tail of optimization opportunities.

Patterns We've Actually Detected

The real validation comes from the kinds of patterns Signals surface in production. These are issues that wouldn't be caught by traditional metrics or manual inspection.

The Customer Data Leak (Priority 10): In a multi-turn airline agent conversation, the agent matched customers by name only rather than unique ID. This caused the agent to retrieve Customer A's booking history while responding to Customer B. The pattern appeared across three sessions before Signals caught it. The category is what we call a hybrid failure—simultaneously a security vulnerability and a hallucination, making it particularly insidious.

Tool Call Cascade (Priority 8): Each tool in a 5-step workflow had an acceptable ~5% failure rate when viewed in isolation. But across 100 workflows, 23 ended with incorrect output due to unrecovered tool failures. The pattern was invisible at the span level—each individual tool call looked fine—but obvious at the session level when you could see the cascade effect.

Policy Drift (Priority 7): A customer service agent gradually stopped following refund policies over a 2-week period. Week 1 showed strict adherence. Week 2 began approving gray areas. Week 3 approved an explicitly prohibited case. The metric "fail" came in week 3, but as a one-off could have easily been ignored as an outlier. Instead, Signals identified the drift over time and suggested a fix to reset compliance.

These examples illustrate what makes unknown unknowns hard to catch: they're either too subtle to trigger simple thresholds, too distributed to see in local views, or too gradual to notice in real-time monitoring.

TECHNICAL SIDEBAR: The Priority Triage System

Signals are priority-ranked 1-10 using a decision tree that considers multiple factors:

Priority 8-10 ("error"): Failures in application output

  • Privacy/compliance risks → 10

  • Monetary/reputational risks → 9

  • User frustration → 8

Priority 4-7 ("warning"): Recoverable issues, inefficiencies

  • Tool errors with recovery

  • Unnecessary tool usage

  • Inconsistent outputs

Priority 1-3 ("info"): Notable patterns, correct edge case handling

The scoring considers severity of impact, confidence in assessment, difficulty to discover manually, and actionability of the signal. The system fundamentally answers: "If the user never saw this, how bad would that be?" The highest priority signals represent the biggest loss from not knowing.

Closing the Loop: From Signals to Guardrails

Detecting unknown problems is valuable, but the real power comes from converting discoveries into ongoing monitoring. After identifying an important signal, users typically want to create an eval metric to track it over time. This used to require manually writing evaluation logic, which was tedious and error-prone.

We realized the same context used to identify the signal could be used to generate the eval automatically. This closes what we call the eval engineering loop: your observability system teaches itself what to watch for.

Here's how it works in practice:

  1. Signals detects an unknown pattern: "LLM ignores error messages in tool calls"

  2. User reviews the Signal, and clicks “Create Metric” from the Signal detail page

  3. System generates custom eval checking for repeated tool calls with the same error

  4. Today's unknown unknown becomes tomorrow's known guardrail

The system goes from discovering a problem to preventing future occurrences of that problem, with minimal manual effort. This is particularly powerful because it means your evaluation suite evolves with your system—as new failure modes emerge, they are automatically monitored.

Conclusion

Agentic systems are too complex for human supervision alone. You can write evals for what you know can fail, but the real risk is what you don't know—the subtle cascading failures, the edge case interactions, the security vulnerabilities that only emerge across hundreds of conversations.

Context engineering solves this by building systems that learn what to watch for, remember what they've seen, and get smarter with every analysis. It's not about replacing human judgment; it's about augmenting it with a system that can process scale and temporal patterns that humans simply can't track manually.

As agents become more complex and autonomous, quality assurance must become more autonomous as well. Signals represent our first step toward AI systems that can reliably supervise other AI systems at production scale. The techniques we've described—multi-stage compression, institutional memory, duplicate prevention, priority triage—form a foundation that will need to evolve as the systems we're monitoring become more sophisticated.

We'd love your feedback: Try Signals on your agent traces and let us know what patterns it surfaces. The more developers use it, the more we learn about what "unknown unknowns" actually look like in the wild, and the better we can make these systems at finding them.

Galileo Signals launches January 21, 2026.

What if your evaluation system got smarter after every failure?

We built something that shouldn't work: an AI system that maintains perfect memory of every issue it's ever detected across your entire agent infrastructure—then uses that knowledge to spot patterns you'd never think to look for.

Why shouldn't it work? Three fundamental constraints made this nearly impossible. First, LLM context windows are limited, and our test dataset was a 25MB file that crushed all of them. Second, simple memory solutions like RAG didn't capture the nuance of AI failure modes—you need to understand not just what happened, but how patterns evolve over time. Third, the cost at scale would explode exponentially across an enterprise customer base if we weren't careful about how we processed data.

This is why most LLM features are stateless. You ask, they answer, they forget. Galileo Signals takes a different approach, maintaining condensed institutional knowledge across weeks and months. Each new analysis builds on every previous finding. It's like having a senior engineer who's reviewed every trace your system has ever produced and can instantly recognize when a new problem matches an old pattern.

The core challenge we set out to solve was this: How do you detect "unknown unknowns" in agentic systems at production scale without exploding costs? Traditional observability is reactive—you write evals for what you know can fail. But agents fail in ways too subtle for human-defined metrics. For example, an agent might leak data between customers with similar names across multi-turn conversations. No metric catches this because you didn't know to look for it. No search query finds it because you don't know what to search for. But Signals does, because it's designed to find problems you don't know exist.

Why This Is Actually Hard

When we started building this, the naive approach seemed tempting: just send your logs to GPT and ask it what went wrong. This fails immediately for several reasons that aren't obvious until you try it.

First, stateless analysis means you get the same observations repeated every run. The LLM tells you about a tool error on Monday, then tells you about the same tool error on Friday, with no understanding that it has seen this pattern before. Second, there's no pattern recognition across time windows—you can't detect that an issue from Week 1 and Week 4 are manifestations of the same underlying problem. Third, running LLM inference per-trace results in a cost explosion, making the approach economically unviable at scale. And finally, you simply can't handle the data volume—25MB+ of trace data per run is too large to process in a single shot.

Our team quickly realized we were facing what became the most trendy problem of 2025: context engineering. We needed to achieve near-perfect compression of previous context while limiting the current batch to a representative sample over a variable time period. The challenge wasn't just making things smaller, it was preserving exactly the right information while discarding everything else.

The Architecture: Solving Compression at Three Levels

We solved this challenge with a multi-stage pipeline, where each stage addresses a specific constraint in the system.

Step 1.1 applies lossless programmatic compression: whitelisting relevant fields, deduplicating tool schemas, compressing repeated messages. This reduces the raw spans without sacrificing relevant information for pattern detection. 

Step 1.2 uses an advanced reasoning model to distill each session into structured notes that capture "everything noteworthy" in dramatically less space (~500KB total). These notes preserve what matters—which patterns occurred and in which spans—while discarding verbosity.

Finally, Step 2 ingests all notes together (now small enough to fit in one context window) along with a historical summary of previously identified signals, and uses an LLM to perform cross-session pattern detection, generating up to 5 priority-ranked signal cards. This architecture maintains the critical "see everything at once" property needed for detecting systemic issues while working within the practical constraints of context windows and API costs.

The Full Pipeline

Raw Spans (25MB)
    
[Step 1.1] Programmatic Compression
    
Compressed Spans
    
[Step 1.2] LLM Note-Taking
    
Distilled Notes 
    
[Step 2] Signal Generation 
    
Signal Cards (5 max, priority-ranked)

The design choice here is deliberate: we use two-stage LLM processing only because context windows aren't infinite. Everything else is single-pass to maximize information flow between stages and ensure the system can draw connections across the entire dataset.

But compression alone doesn't solve the hardest part of the problem. The real breakthrough came from how we handle memory and pattern recognition over time.

Context Engineering: The Secret Sauce

To understand why our approach works, it helps to see what doesn't work. Here's how traditional "chat with logs" features operate:

User Query LLM (fresh logs) Answer Forget Everything

Every interaction is independent. The LLM has no memory of what it told you yesterday, what patterns it's seen before, or how current behavior compares to historical baselines. You're starting from scratch every time.

Here's how Signals works differently:

def generate_insights(new_traces, historical_signals):
    # Load condensed knowledge from all previous runs
    context = compress_historical_patterns(historical_signals)
    
    # LLM gets BOTH new data AND institutional memory
    prompt = f"""
    Historical patterns detected:
    {context}
    
    New traces to analyze:
    {new_traces}
    
    Task: 
    1. If new traces match existing patterns, add to that signal
    2. If genuinely novel pattern, create new signal
    3. Do NOT duplicate existing signals
    """
    
    return llm.analyze(prompt, model="claude-sonnet-4")

The key innovation is that the system maintains institutional memory. New traces automatically group with existing patterns. Signals evolve and compound over time. The LLM sees the full picture, not just today's data. This creates a fundamentally different capability—the system gets smarter with every run instead of starting fresh each time.

This approach required us to solve several subtle technical challenges that weren't obvious at the start.

Preventing Duplicate Detection

Having institutional memory creates a new problem: without careful prompt engineering, an LLM will generate "New signal: Tool call errors in checkout flow" on every run, even if it found this exact pattern 3 weeks ago. The LLM doesn't naturally understand that finding the same pattern again isn't news—it's confirmation.

Our solution has three parts. First, we pass ALL existing signal IDs and descriptions to the LLM in every request. Second, we give explicit instructions: "These patterns already exist - do not recreate them." Third, and most importantly, we ask the LLM to perform structured reasoning: "Is this trace a new manifestation of signal #47 or genuinely novel?"

This turns out to be non-trivial. The LLM must reason about similarity across different failure manifestations, since the same root cause can manifest differently across contexts. This requires balancing granularity (creating too many narrowly-defined signals) against aggregation (grouping too many distinct issues together).

Here's what the LLM's reasoning process looks like (from extended thinking):

New trace 045 shows agent calling get_weather with invalid format.
This matches the pattern in existing Signal #7.
The error message is identical to examples already in Signal #7.
Conclusion: NOT a new pattern. Add trace 045 to Signal #7.

versus:

New trace 051 shows agent exposing customer A's order to customer B.

Existing signals: tool errors (#7), hallucinations (#12), latency (#3).

None describe cross-customer data leakage.

Conclusion: NEW pattern. Create signal about privacy violation

The system has learned to distinguish between "I've seen this before" and "this is genuinely new." That distinction is what makes the institutional memory useful rather than just noisy.

The First Stage: Programmatic Compression

Before any LLM even sees the data, we apply lossless compression that reduces size without losing information relevant to pattern detection. This stage is deterministic and fast.

Field whitelisting drops project metadata, user IDs, and timestamps while keeping only behavior-relevant fields. Tool definition deduplication replaces repeated tool schemas with "<same as previous>". Message accumulation detection recognizes that in stateful LLM interactions, messages accumulate predictably, so we compress repeated context into references.

The result: 25MB → ~7MB with zero information loss for pattern detection purposes. This doesn't sound like much, but it's enough to make the next stages tractable.

Real-World Validation

Building a system that detects unknown problems creates an obvious challenge: how do you validate that it works? You can't create a labeled test set of "unknown unknowns" because by definition, you don't know what they are. We needed a different approach.

The NVIDIA Stress Test

NVIDIA provided us with 25MB of production traces containing known issues, but critically, they didn't tell us what those issues were. The challenge was simple: Can your system surface our problems without any hints?

This became our primary validation dataset for v2.0 development. We iterated on prompt engineering, compression strategies, and pattern recognition until Signals reliably identified their problematic patterns autonomously. The test proved the system could work at enterprise data scales and find real issues that engineers cared about.

Production Deployments

Beyond controlled testing, we needed to see how Signals performed in actual production environments with real users. A large agent platform customer runs weekly automated batches, generating 200-300 signals per week. This proves Signals works as an operational practice, not just a one-off analysis tool.

Across our early access customers, we see hundreds of signals generated daily, spanning priority 1 (interesting but low-urgency patterns) to priority 10 (critical failures requiring immediate attention). The distribution tells us something important: most AI systems have a handful of critical issues and a long tail of optimization opportunities.

Patterns We've Actually Detected

The real validation comes from the kinds of patterns Signals surface in production. These are issues that wouldn't be caught by traditional metrics or manual inspection.

The Customer Data Leak (Priority 10): In a multi-turn airline agent conversation, the agent matched customers by name only rather than unique ID. This caused the agent to retrieve Customer A's booking history while responding to Customer B. The pattern appeared across three sessions before Signals caught it. The category is what we call a hybrid failure—simultaneously a security vulnerability and a hallucination, making it particularly insidious.

Tool Call Cascade (Priority 8): Each tool in a 5-step workflow had an acceptable ~5% failure rate when viewed in isolation. But across 100 workflows, 23 ended with incorrect output due to unrecovered tool failures. The pattern was invisible at the span level—each individual tool call looked fine—but obvious at the session level when you could see the cascade effect.

Policy Drift (Priority 7): A customer service agent gradually stopped following refund policies over a 2-week period. Week 1 showed strict adherence. Week 2 began approving gray areas. Week 3 approved an explicitly prohibited case. The metric "fail" came in week 3, but as a one-off could have easily been ignored as an outlier. Instead, Signals identified the drift over time and suggested a fix to reset compliance.

These examples illustrate what makes unknown unknowns hard to catch: they're either too subtle to trigger simple thresholds, too distributed to see in local views, or too gradual to notice in real-time monitoring.

TECHNICAL SIDEBAR: The Priority Triage System

Signals are priority-ranked 1-10 using a decision tree that considers multiple factors:

Priority 8-10 ("error"): Failures in application output

  • Privacy/compliance risks → 10

  • Monetary/reputational risks → 9

  • User frustration → 8

Priority 4-7 ("warning"): Recoverable issues, inefficiencies

  • Tool errors with recovery

  • Unnecessary tool usage

  • Inconsistent outputs

Priority 1-3 ("info"): Notable patterns, correct edge case handling

The scoring considers severity of impact, confidence in assessment, difficulty to discover manually, and actionability of the signal. The system fundamentally answers: "If the user never saw this, how bad would that be?" The highest priority signals represent the biggest loss from not knowing.

Closing the Loop: From Signals to Guardrails

Detecting unknown problems is valuable, but the real power comes from converting discoveries into ongoing monitoring. After identifying an important signal, users typically want to create an eval metric to track it over time. This used to require manually writing evaluation logic, which was tedious and error-prone.

We realized the same context used to identify the signal could be used to generate the eval automatically. This closes what we call the eval engineering loop: your observability system teaches itself what to watch for.

Here's how it works in practice:

  1. Signals detects an unknown pattern: "LLM ignores error messages in tool calls"

  2. User reviews the Signal, and clicks “Create Metric” from the Signal detail page

  3. System generates custom eval checking for repeated tool calls with the same error

  4. Today's unknown unknown becomes tomorrow's known guardrail

The system goes from discovering a problem to preventing future occurrences of that problem, with minimal manual effort. This is particularly powerful because it means your evaluation suite evolves with your system—as new failure modes emerge, they are automatically monitored.

Conclusion

Agentic systems are too complex for human supervision alone. You can write evals for what you know can fail, but the real risk is what you don't know—the subtle cascading failures, the edge case interactions, the security vulnerabilities that only emerge across hundreds of conversations.

Context engineering solves this by building systems that learn what to watch for, remember what they've seen, and get smarter with every analysis. It's not about replacing human judgment; it's about augmenting it with a system that can process scale and temporal patterns that humans simply can't track manually.

As agents become more complex and autonomous, quality assurance must become more autonomous as well. Signals represent our first step toward AI systems that can reliably supervise other AI systems at production scale. The techniques we've described—multi-stage compression, institutional memory, duplicate prevention, priority triage—form a foundation that will need to evolve as the systems we're monitoring become more sophisticated.

We'd love your feedback: Try Signals on your agent traces and let us know what patterns it surfaces. The more developers use it, the more we learn about what "unknown unknowns" actually look like in the wild, and the better we can make these systems at finding them.

Galileo Signals launches January 21, 2026.

What if your evaluation system got smarter after every failure?

We built something that shouldn't work: an AI system that maintains perfect memory of every issue it's ever detected across your entire agent infrastructure—then uses that knowledge to spot patterns you'd never think to look for.

Why shouldn't it work? Three fundamental constraints made this nearly impossible. First, LLM context windows are limited, and our test dataset was a 25MB file that crushed all of them. Second, simple memory solutions like RAG didn't capture the nuance of AI failure modes—you need to understand not just what happened, but how patterns evolve over time. Third, the cost at scale would explode exponentially across an enterprise customer base if we weren't careful about how we processed data.

This is why most LLM features are stateless. You ask, they answer, they forget. Galileo Signals takes a different approach, maintaining condensed institutional knowledge across weeks and months. Each new analysis builds on every previous finding. It's like having a senior engineer who's reviewed every trace your system has ever produced and can instantly recognize when a new problem matches an old pattern.

The core challenge we set out to solve was this: How do you detect "unknown unknowns" in agentic systems at production scale without exploding costs? Traditional observability is reactive—you write evals for what you know can fail. But agents fail in ways too subtle for human-defined metrics. For example, an agent might leak data between customers with similar names across multi-turn conversations. No metric catches this because you didn't know to look for it. No search query finds it because you don't know what to search for. But Signals does, because it's designed to find problems you don't know exist.

Why This Is Actually Hard

When we started building this, the naive approach seemed tempting: just send your logs to GPT and ask it what went wrong. This fails immediately for several reasons that aren't obvious until you try it.

First, stateless analysis means you get the same observations repeated every run. The LLM tells you about a tool error on Monday, then tells you about the same tool error on Friday, with no understanding that it has seen this pattern before. Second, there's no pattern recognition across time windows—you can't detect that an issue from Week 1 and Week 4 are manifestations of the same underlying problem. Third, running LLM inference per-trace results in a cost explosion, making the approach economically unviable at scale. And finally, you simply can't handle the data volume—25MB+ of trace data per run is too large to process in a single shot.

Our team quickly realized we were facing what became the most trendy problem of 2025: context engineering. We needed to achieve near-perfect compression of previous context while limiting the current batch to a representative sample over a variable time period. The challenge wasn't just making things smaller, it was preserving exactly the right information while discarding everything else.

The Architecture: Solving Compression at Three Levels

We solved this challenge with a multi-stage pipeline, where each stage addresses a specific constraint in the system.

Step 1.1 applies lossless programmatic compression: whitelisting relevant fields, deduplicating tool schemas, compressing repeated messages. This reduces the raw spans without sacrificing relevant information for pattern detection. 

Step 1.2 uses an advanced reasoning model to distill each session into structured notes that capture "everything noteworthy" in dramatically less space (~500KB total). These notes preserve what matters—which patterns occurred and in which spans—while discarding verbosity.

Finally, Step 2 ingests all notes together (now small enough to fit in one context window) along with a historical summary of previously identified signals, and uses an LLM to perform cross-session pattern detection, generating up to 5 priority-ranked signal cards. This architecture maintains the critical "see everything at once" property needed for detecting systemic issues while working within the practical constraints of context windows and API costs.

The Full Pipeline

Raw Spans (25MB)
    
[Step 1.1] Programmatic Compression
    
Compressed Spans
    
[Step 1.2] LLM Note-Taking
    
Distilled Notes 
    
[Step 2] Signal Generation 
    
Signal Cards (5 max, priority-ranked)

The design choice here is deliberate: we use two-stage LLM processing only because context windows aren't infinite. Everything else is single-pass to maximize information flow between stages and ensure the system can draw connections across the entire dataset.

But compression alone doesn't solve the hardest part of the problem. The real breakthrough came from how we handle memory and pattern recognition over time.

Context Engineering: The Secret Sauce

To understand why our approach works, it helps to see what doesn't work. Here's how traditional "chat with logs" features operate:

User Query LLM (fresh logs) Answer Forget Everything

Every interaction is independent. The LLM has no memory of what it told you yesterday, what patterns it's seen before, or how current behavior compares to historical baselines. You're starting from scratch every time.

Here's how Signals works differently:

def generate_insights(new_traces, historical_signals):
    # Load condensed knowledge from all previous runs
    context = compress_historical_patterns(historical_signals)
    
    # LLM gets BOTH new data AND institutional memory
    prompt = f"""
    Historical patterns detected:
    {context}
    
    New traces to analyze:
    {new_traces}
    
    Task: 
    1. If new traces match existing patterns, add to that signal
    2. If genuinely novel pattern, create new signal
    3. Do NOT duplicate existing signals
    """
    
    return llm.analyze(prompt, model="claude-sonnet-4")

The key innovation is that the system maintains institutional memory. New traces automatically group with existing patterns. Signals evolve and compound over time. The LLM sees the full picture, not just today's data. This creates a fundamentally different capability—the system gets smarter with every run instead of starting fresh each time.

This approach required us to solve several subtle technical challenges that weren't obvious at the start.

Preventing Duplicate Detection

Having institutional memory creates a new problem: without careful prompt engineering, an LLM will generate "New signal: Tool call errors in checkout flow" on every run, even if it found this exact pattern 3 weeks ago. The LLM doesn't naturally understand that finding the same pattern again isn't news—it's confirmation.

Our solution has three parts. First, we pass ALL existing signal IDs and descriptions to the LLM in every request. Second, we give explicit instructions: "These patterns already exist - do not recreate them." Third, and most importantly, we ask the LLM to perform structured reasoning: "Is this trace a new manifestation of signal #47 or genuinely novel?"

This turns out to be non-trivial. The LLM must reason about similarity across different failure manifestations, since the same root cause can manifest differently across contexts. This requires balancing granularity (creating too many narrowly-defined signals) against aggregation (grouping too many distinct issues together).

Here's what the LLM's reasoning process looks like (from extended thinking):

New trace 045 shows agent calling get_weather with invalid format.
This matches the pattern in existing Signal #7.
The error message is identical to examples already in Signal #7.
Conclusion: NOT a new pattern. Add trace 045 to Signal #7.

versus:

New trace 051 shows agent exposing customer A's order to customer B.

Existing signals: tool errors (#7), hallucinations (#12), latency (#3).

None describe cross-customer data leakage.

Conclusion: NEW pattern. Create signal about privacy violation

The system has learned to distinguish between "I've seen this before" and "this is genuinely new." That distinction is what makes the institutional memory useful rather than just noisy.

The First Stage: Programmatic Compression

Before any LLM even sees the data, we apply lossless compression that reduces size without losing information relevant to pattern detection. This stage is deterministic and fast.

Field whitelisting drops project metadata, user IDs, and timestamps while keeping only behavior-relevant fields. Tool definition deduplication replaces repeated tool schemas with "<same as previous>". Message accumulation detection recognizes that in stateful LLM interactions, messages accumulate predictably, so we compress repeated context into references.

The result: 25MB → ~7MB with zero information loss for pattern detection purposes. This doesn't sound like much, but it's enough to make the next stages tractable.

Real-World Validation

Building a system that detects unknown problems creates an obvious challenge: how do you validate that it works? You can't create a labeled test set of "unknown unknowns" because by definition, you don't know what they are. We needed a different approach.

The NVIDIA Stress Test

NVIDIA provided us with 25MB of production traces containing known issues, but critically, they didn't tell us what those issues were. The challenge was simple: Can your system surface our problems without any hints?

This became our primary validation dataset for v2.0 development. We iterated on prompt engineering, compression strategies, and pattern recognition until Signals reliably identified their problematic patterns autonomously. The test proved the system could work at enterprise data scales and find real issues that engineers cared about.

Production Deployments

Beyond controlled testing, we needed to see how Signals performed in actual production environments with real users. A large agent platform customer runs weekly automated batches, generating 200-300 signals per week. This proves Signals works as an operational practice, not just a one-off analysis tool.

Across our early access customers, we see hundreds of signals generated daily, spanning priority 1 (interesting but low-urgency patterns) to priority 10 (critical failures requiring immediate attention). The distribution tells us something important: most AI systems have a handful of critical issues and a long tail of optimization opportunities.

Patterns We've Actually Detected

The real validation comes from the kinds of patterns Signals surface in production. These are issues that wouldn't be caught by traditional metrics or manual inspection.

The Customer Data Leak (Priority 10): In a multi-turn airline agent conversation, the agent matched customers by name only rather than unique ID. This caused the agent to retrieve Customer A's booking history while responding to Customer B. The pattern appeared across three sessions before Signals caught it. The category is what we call a hybrid failure—simultaneously a security vulnerability and a hallucination, making it particularly insidious.

Tool Call Cascade (Priority 8): Each tool in a 5-step workflow had an acceptable ~5% failure rate when viewed in isolation. But across 100 workflows, 23 ended with incorrect output due to unrecovered tool failures. The pattern was invisible at the span level—each individual tool call looked fine—but obvious at the session level when you could see the cascade effect.

Policy Drift (Priority 7): A customer service agent gradually stopped following refund policies over a 2-week period. Week 1 showed strict adherence. Week 2 began approving gray areas. Week 3 approved an explicitly prohibited case. The metric "fail" came in week 3, but as a one-off could have easily been ignored as an outlier. Instead, Signals identified the drift over time and suggested a fix to reset compliance.

These examples illustrate what makes unknown unknowns hard to catch: they're either too subtle to trigger simple thresholds, too distributed to see in local views, or too gradual to notice in real-time monitoring.

TECHNICAL SIDEBAR: The Priority Triage System

Signals are priority-ranked 1-10 using a decision tree that considers multiple factors:

Priority 8-10 ("error"): Failures in application output

  • Privacy/compliance risks → 10

  • Monetary/reputational risks → 9

  • User frustration → 8

Priority 4-7 ("warning"): Recoverable issues, inefficiencies

  • Tool errors with recovery

  • Unnecessary tool usage

  • Inconsistent outputs

Priority 1-3 ("info"): Notable patterns, correct edge case handling

The scoring considers severity of impact, confidence in assessment, difficulty to discover manually, and actionability of the signal. The system fundamentally answers: "If the user never saw this, how bad would that be?" The highest priority signals represent the biggest loss from not knowing.

Closing the Loop: From Signals to Guardrails

Detecting unknown problems is valuable, but the real power comes from converting discoveries into ongoing monitoring. After identifying an important signal, users typically want to create an eval metric to track it over time. This used to require manually writing evaluation logic, which was tedious and error-prone.

We realized the same context used to identify the signal could be used to generate the eval automatically. This closes what we call the eval engineering loop: your observability system teaches itself what to watch for.

Here's how it works in practice:

  1. Signals detects an unknown pattern: "LLM ignores error messages in tool calls"

  2. User reviews the Signal, and clicks “Create Metric” from the Signal detail page

  3. System generates custom eval checking for repeated tool calls with the same error

  4. Today's unknown unknown becomes tomorrow's known guardrail

The system goes from discovering a problem to preventing future occurrences of that problem, with minimal manual effort. This is particularly powerful because it means your evaluation suite evolves with your system—as new failure modes emerge, they are automatically monitored.

Conclusion

Agentic systems are too complex for human supervision alone. You can write evals for what you know can fail, but the real risk is what you don't know—the subtle cascading failures, the edge case interactions, the security vulnerabilities that only emerge across hundreds of conversations.

Context engineering solves this by building systems that learn what to watch for, remember what they've seen, and get smarter with every analysis. It's not about replacing human judgment; it's about augmenting it with a system that can process scale and temporal patterns that humans simply can't track manually.

As agents become more complex and autonomous, quality assurance must become more autonomous as well. Signals represent our first step toward AI systems that can reliably supervise other AI systems at production scale. The techniques we've described—multi-stage compression, institutional memory, duplicate prevention, priority triage—form a foundation that will need to evolve as the systems we're monitoring become more sophisticated.

We'd love your feedback: Try Signals on your agent traces and let us know what patterns it surfaces. The more developers use it, the more we learn about what "unknown unknowns" actually look like in the wild, and the better we can make these systems at finding them.

Galileo Signals launches January 21, 2026.

If you find this helpful and interesting,

Bipin Shetty