Why AI Brittleness Is the Real Reliability Challenge Beyond Non-Determinism

Pratik Bhavsar

Evals & Leaderboards @ Galileo Labs

Your customer support production agent answered the refund policy question correctly 100 times in a row. Temperature was pinned to zero, seeds were fixed, and your eval suite showed 99% accuracy. Then someone typed "please" before the same question. The production agent invented a refund policy that never existed. The response was confident, fluent, and completely wrong. 

This gap between what you may think AI reliability means and what it actually requires defines a central production challenge of 2026. You may equate reliability with deterministic outputs. The real requirement is stable behavior under realistic input variation. Non-determinism is a largely controllable engineering problem. Brittleness is harder, and conflating the two costs you real money every quarter.

TLDR:

  • Non-determinism for identical inputs is controllable with temperature settings.

  • Brittleness across paraphrased inputs cannot be fixed with temperature.

  • You can fix non-determinism and still miss the bigger reliability risk.

  • Detecting brittleness requires paraphrase tests, edge case libraries, and dedicated metrics.

  • Production-ready means stable behavior across realistic input variation.

Distinguishing Non-Determinism From Brittleness In AI Systems

This definitional confusion costs you real money. When your production agent gives different answers to what seems like the same question, two different failure modes could be responsible. Non-determinism means the model produced different outputs from identical input. Brittleness means the model produced different outputs because the same question was phrased slightly differently. 

From the outside, both look like inconsistency. Under the hood, they have different root causes and different fixes. If you treat one as the other, you waste engineering effort and leave the harder problem unaddressed.

Recognizing Why Non-Determinism Is The Controllable Problem

Non-determinism occurs when the exact same input produces different output tokens across runs. The root cause is stochastic sampling. Temperature scaling, top-p nucleus sampling, and top-k truncation introduce randomness into token selection. Beyond sampling, floating-point non-associativity in batched GPU inference shifts logit values depending on batch composition. This can produce different argmax selections even at temperature=0.

The standard engineering controls are well-documented. Setting temperature to zero enables greedy decoding. Fixed random seeds and deterministic decoding modes reduce variance further. The OpenAI guide describes "mostly deterministic" outputs with seed and temperature=0. The system_fingerprint field helps detect backend configuration changes when providers update model configurations on their end. 

Thinking Machines Lab demonstrated that batch-invariant kernels reduced 80 unique outputs across 1,000 runs down to one. You will usually accept some residual variance. The problem is mechanistically understood, has known engineering controls, and can be driven to negligible levels through a combination of sampling configuration, deterministic kernels, and careful version pinning. None of which addresses the harder failure mode discussed next.

Recognizing Why Brittleness Is The Harder AI Reliability Challenge

When someone adds "please" to an otherwise identical query and gets a completely different answer, that is brittleness. It is a fundamentally different failure mode from non-determinism. Small, semantically equivalent changes in input produce wildly different outputs even at temperature zero. Added punctuation, a paraphrased question, or a polite versus curt framing can flip the model's answer entirely. 

The research on meaning-preserving perturbations found answer-flip rates of 28.8% to 45.1% at greedy decoding across semantically equivalent math problems. Temperature controls cannot help here. The model treats semantically identical inputs as meaningfully different.

Formatting changes alone, including different delimiters, capitalization patterns, or whitespace, also produce significant performance gaps even with no semantic modifications. The sensitivity persists when you scale up model size or apply instruction tuning. That is why mitigation cannot rely on bigger models alone.

For you, compounding risk matters most. In multi-step workflows, even small per-step errors compound into irreversible failures over extended task horizons, which is one of the core reasons most AI agents fail in production. Brittleness is especially difficult to predict because it emerges from the model's internal representations rather than from any configurable parameter. That is why it requires different detection approaches than non-determinism.

Counting The Cost Of Conflating Non-Determinism With Brittleness

If you pin temperature and declare reliability fixed, you ship with a dangerous blind spot. Your pre-production evals typically replay identical inputs. They detect non-determinism but never probe realistic phrasing variation. The result is false confidence. Evals report high accuracy on clean test sets while production traffic exposes brittleness through messy, diverse phrasing.

The costs compound in specific ways. When you tune prompts, small wording changes can shift accuracy unpredictably. Regression testing on fixed inputs misses these shifts entirely. When incidents occur, brittleness often gets misdiagnosed as randomness. Tightening temperature settings changes nothing about the underlying sensitivity. 

A Microsoft practitioner study found that 76.6% of survey respondents report manual effort as a major factor in their LLM evaluation work. You may still lack proper eval mechanisms and the infrastructure to detect brittleness systematically.

The executive risk is quantifiable. Recent agent reliability research shows that autonomous agents achieving 60% pass@1 on benchmarks may exhibit only 25% consistency across multiple trials. That is a 35-percentage-point gap between what evals measure and what production demands. Studies on class-level code generation in real-world projects have documented that LLMs scoring 84% to 89% on synthetic benchmarks drop to 25% to 34% on real-world tasks, confirming that strong benchmark performance often does not translate to comparable production success. 

When Air Canada's chatbot fabricated a refund policy contradicting its own company's documented rules, the British Columbia Civil Resolution Tribunal ruled the airline liable. The ruling, a textbook example of why language models hallucinate under realistic input variation, underscored the risks you face when relying on AI-generated customer-facing content.

Detecting Brittleness In Production AI Agents

Standard accuracy metrics cannot surface brittleness. Brittleness is a property of the input distribution, not of any single response. A production agent scoring 95% accuracy on a clean test set can still fail catastrophically when customers rephrase those same questions. Detection requires a layered strategy that you run continuously. 

Adversarial input variation tests probe known weaknesses. Paraphrase stability testing measures consistency across semantically equivalent inputs. Edge case regression libraries capture institutional memory from past failures.

Running Adversarial Input Variation Tests

Adversarial variation means programmatically perturbing known-good inputs and measuring output changes. A useful perturbation taxonomy covers three levels. Character-level perturbations include typos and capitalization changes. Word-level changes include synonym substitution and dropped context words. Sentence-level reframings cover polite versus curt register and added filler phrases.

Say you are running a financial production agent. It answers "What was Q3 revenue?" correctly with $4.2 billion. When someone asks "what was the revenue in Q3" it returns a different number. Research shows that best-case reward can be twice the worst-case reward for semantically equivalent prompts. 

The same intent expressed differently produces radically different quality. Run these tests as automated pre-deployment checks tied to a brittleness threshold. If the consistency rate across perturbation sets drops below 85%, the result warrants investigation before deployment.

Leading AI teams use platforms like Galileo's Signals to detect failure patterns across production traces and surface unknown unknowns without requiring you to know what to search for first.

Implementing Paraphrase Testing In Evals

Paraphrase testing goes beyond simple re-runs. Generate 5 to 10 semantically equivalent paraphrases of each eval prompt. Then measure output stability across the set. Re-running the same input only tests non-determinism. Paraphrase testing reveals whether your production agent stays consistent when people express the same intent differently.

Research across multiple benchmarks found that 15% to 30% of questions elicit at least two different answers when paraphrase variants are tested. Twenty-eight of 34 models tested on MMLU performed worse under paraphrased inputs. 

Semantic equivalence judgments need their own evaluator because small phrasing changes can shift intent in subtle ways. LLM-generated paraphrases may also inflate robustness scores by standardizing language or exploiting model preference for model-generated text. 

Prefer rule-based or human-verified paraphrases where possible, especially for domain-specific terminology where wording carries technical meaning. Track paraphrase stability alongside traditional accuracy and flag degradation between deployment cycles. A drop in stability without a drop in headline accuracy is the early warning that brittleness is creeping into your system.

Maintaining Edge Case Libraries For Regression Testing

Every brittleness incident should become a permanent test case. Every prompt change should trigger a re-run of the full library. Turning production failures into regression tests creates a feedback loop between what happens in production and what your evals cover.

What belongs in the library includes minority phrasings representing how real people talk, multilingual variations, role-play and jailbreak attempts, ambiguous pronouns, inputs that triggered past failures, and inputs with incorrect premises embedded in normal-sounding questions. UK financial chatbots failed to catch a wrong ISA allowance figure when a question embedded a deliberate factual error. 

You should have domain experts collaborate to identify edge cases and nuanced failure modes. Synthetic data can expand coverage for rare conditions. This is the AI era equivalent of a regression test suite. It takes effort to build, but it remains one of your most durable defenses against brittleness creep as prompts evolve and models get updated.

Hardening AI Systems Against Input Variation

No single intervention eliminates brittleness. If you want stable production behavior, you need layered defenses across prompt design, retrieval strategy, model selection, and runtime guardrails. Each layer addresses a different variation surface. The combination provides resilience that no individual technique can match. Detection tells you where brittleness exists. Mitigation reduces its impact on your customers.

Designing Systems For Robustness Over Raw Accuracy

Your production agents should optimize for stability across input variation, not peak accuracy on a clean test set. Research confirms that higher accuracy or narrow accuracy ranges do not always guarantee better consistency. Model size alone is not a reliable indicator of robustness. Two agents with identical accuracy scores can have very different stability profiles when you stress them with realistic input variation.

Concrete tactics can reduce brittleness at the system level. Input normalization layers, such as Unicode normalization and whitespace canonicalization, eliminate surface-level variation before it reaches the model. Semantic search can outperform keyword search because hybrid retrieval models tend to be more robust across different query types. 

Structured intermediate representations follow the principle of separating interpretation from computation, so errors in either stage are easier to localize. Ensemble strategies use disagreement to trigger human escalation rather than letting uncertain outputs reach customers silently. Prompt templates with explicit refusal paths reduce the chance that an unfamiliar phrasing leads to a confident hallucination instead of a graceful fallback.

Evolving Eval Frameworks To Surface Brittleness Specifically

Standard accuracy and groundedness metrics evaluate single responses against single references. They do not measure stability across input variants, which is the defining property of brittleness. Research on chain-of-thought prompting under perturbation shows mixed effects on accuracy. Accuracy metrics miss reasoning-level brittleness entirely.

Brittleness-aware metrics worth tracking include response stability across paraphrase sets, agreement rate across structured input perturbations at character, word, and sentence levels, and tool-selection consistency across semantically equivalent queries for autonomous agents. 

These complement the core LLM performance metrics that already anchor your reliability stack. Purpose-built eval models like Luna-2, paired with Autotune, make it economically feasible to run stability checks at production scale with sub-second latency and at a fraction of the cost of GPT-style judges. 

Stability evals only deliver value if they run on every production trace, not on a 10% sample, and small models built for evaluation are what makes that practical. Without that scale, brittleness will remain something you discover through customer complaints rather than something you measure proactively in your own dashboards.

Setting The Bar For Production-Ready AI Reliability

Reliable enough for production cannot be a single threshold applied universally. It depends on failure cost, recoverability, and how brittleness shows up in your domain. A customer support production agent that occasionally varies its tone carries different risk than a financial production agent returning inconsistent revenue numbers.

A working definition you can apply is this. A production agent is ready when its measured failure rate under realistic input variation stays within the cost envelope you can absorb. Failures must be detectable and recoverable before they compound. 

The NIST AI RMF requires that performance criteria be measured under conditions similar to deployment settings, not just clean test inputs. Generalizability limitations must be documented explicitly. NIST does not prescribe threshold values. You have to determine them based on context and use case. But the mandate is clear. Measure under realistic stress, document the gap between benchmark and production, and establish minimum thresholds before deployment.

Some analysts note that multi-agentic workflows can create compounded risk, though this specific framing is not stated in Gartner reports. You may still lack true trust in autonomous agents' ability to operate without human oversight, which is exactly the gap that strong LLM observability practices are meant to close. 

The implication is direct. Reliability metrics that only measure clean-input accuracy systematically overstate readiness. If you ship on those metrics, customers discover the gap before your dashboards do. The real production bar is stable behavior under inputs you have not seen yet. You validate that bar through adversarial testing, paraphrase eval, and continuous regression, not deterministic replays of clean test cases.

Building An AI Reliability Strategy Around Brittleness

If you want reliable AI behavior in production, deterministic replays are only the starting point. Non-determinism is a sampling problem with well-understood controls such as temperature settings, fixed seeds, and deterministic decoding modes. Brittleness is a behavioral stability problem that requires systematic eval engineering, paraphrase testing, and continuous regression libraries. When you conflate the two, you overestimate readiness and absorb the cost later through incident response, customer churn, and legal exposure.

A stronger reliability strategy treats brittleness as a first-class failure mode. That means detecting it early, measuring it continuously, and putting guardrails around the failure patterns that matter most in production. For teams that need agent observability and guardrails tied directly to this workflow, Galileo is the natural next step.

  • Signals: Automatically surfaces failure patterns across production traces so you can detect brittleness before it turns into a repeated incident.

  • Luna-2: Runs purpose-built evals at production scale so you can measure stability across paraphrases and perturbations without GPT-style costs.

  • Autotune: Improves LLM-powered metrics with feedback so your brittleness checks get more accurate over time.

  • Runtime Protection: Turns eval insights into runtime guardrails so brittle outputs can be blocked or redirected before they reach customers.

  • Agent observability: Gives you visibility, evaluation, and control across production behavior so debugging does not stop at clean-input accuracy.

Book a demo to see how your team can detect, measure, and reduce brittleness across production AI agents.

FAQs

What's The Difference Between Non-Determinism And Brittleness In LLMs?

Non-determinism means the same exact input produces different outputs across runs. It is caused by stochastic sampling and GPU-level floating-point variance. Brittleness means semantically equivalent but differently phrased inputs produce different outputs, even at temperature=0. Non-determinism can be reduced with temperature controls and fixed seeds, but not fully eliminated. Brittleness persists under deterministic decoding because the model is sensitive to input variation, not just output randomness.

Does Setting Temperature To 0 Make An AI Agent Reliable?

No. Temperature controls only address output variance for identical inputs. Research documents answer-flip rates of 28.8% to 45.1% at temperature=0 when semantically equivalent input variants are tested. True agent reliability requires stable behavior across the range of phrasings people actually produce. Temperature settings do not address that. You still need adversarial testing, paraphrase evals, and runtime guardrails.

How Do You Test An AI Agent For Brittleness?

Use a three-layer detection approach. First, run adversarial input variation tests that perturb known-good inputs with typos, capitalization changes, and polite or curt reframings. Second, implement paraphrase stability testing with 5 to 10 semantically equivalent versions of each eval prompt. Third, maintain an edge case regression library that converts every production failure into a permanent test case. Brittleness testing needs its own metrics, such as consistency rate and perturbation agreement rate, not just standard accuracy.

What Metrics Indicate AI Brittleness In Production?

Track response stability across paraphrase sets, tool-selection consistency for autonomous agents, and agreement rates across character, word, and sentence-level perturbations. Research shows that higher accuracy does not always guarantee better consistency. These metrics need to be tracked independently from standard groundedness and accuracy scores. Clusters of semantically related production failures are another useful indicator.

Do I Need Different Evals For Brittleness And Non-Determinism?

Yes. Replaying the same prompt tests output variance, which helps you measure non-determinism. Testing paraphrases, perturbations, and edge cases helps you measure brittleness under realistic input variation. If you only run deterministic replays, you miss the failure mode that customers are more likely to trigger in production.

Pratik Bhavsar