Sep 8, 2025

Understanding Why Language Models Hallucinate?

Pratik Bhavsar

Galileo Labs

Pratik Bhavsar

Galileo Labs

AI Agent Compliance & Governance in 2025 | Galileo
AI Agent Compliance & Governance in 2025 | Galileo

A recent paper from OpenAI "Why Language Models Hallucinate" claims to prove mathematically why language models hallucinate. Their argument is counterintuitive: even with perfect training data and infinite compute, models will confidently state falsehoods. The culprit isn't bad engineering—it's information theory.

The paper presents an elegant framework showing that hallucinations arise from how we train and evaluate models. But here's the thing: while the math is solid, the practical implications are more nuanced than the authors suggest. Modern techniques like retrieval-augmented generation and chain-of-thought prompting are already working around these theoretical limits.

Let's dig into what the paper actually proves, where the framework breaks down, and what it means for building AI systems.

The Core Insight: Generation Is Harder Than Verification

The paper's central contribution is proving that generating correct text is fundamentally harder than verifying whether text is correct. This seems obvious, but the mathematical formulation reveals surprising implications.

Consider asking a model about someone's birthday:

The Verification Task:

"Is Adam Kalai's birthday March 7th?"

The model just needs to output yes or no. Even random guessing gets 50% accuracy.

The Generation Task:

"What is Adam Kalai's birthday?"

The model must select from 365 possible dates. Random guessing gets 0.27% accuracy—about 180 times harder than verification.

The paper formalizes this into the Generation-Classification Inequality: generation error rate is always at least twice the classification error rate, minus a calibration term. When classification is imperfect (which it always is for unseen facts), generation must also be imperfect.

This inequality holds even with perfect training data. The model can memorize every fact it sees, but it still has to assign probabilities to facts it hasn't seen. And some of those assignments will be wrong.

Why Perfect Training Data Doesn't Solve Hallucinations

Here's where the paper gets interesting. Even if every single fact in your training data is true, models will still hallucinate about things not in that data.

Consider this training scenario:

  • Training data contains birthdays for 1,000 scientists

  • All dates are 100% accurate

  • Model achieves perfect accuracy on training data

Now ask: "When was obscure researcher Jane Smith born?"

The model has never seen Jane Smith's birthday. But it learned that scientists have birthdays distributed across the year. So it assigns probabilities to different dates based on patterns from the scientists it has seen. Whatever date it generates is essentially a educated guess based on statistical patterns—not a retrieved fact.

The paper proves this creates a hallucination floor. For facts appearing exactly once in training data (singletons), the hallucination rate is at least equal to the singleton rate. Empirical measurements show 20-30% of biographical facts are singletons in typical training corpora, predicting minimum 20-30% hallucination rates for rare facts.

This matches what we observe in practice. Models are highly accurate on well-represented facts but struggle with tail knowledge.

The Evaluation Problem Nobody Talks About

The paper's most actionable insight has nothing to do with theoretical limits. It's about how we score models.

Look at how major benchmarks handle uncertainty:

  • MMLU: Multiple choice, no credit for abstaining

  • GPQA: Multiple choice, no credit for abstaining

  • SWE-bench: Binary pass/fail on unit tests

Every major benchmark uses binary scoring that punishes abstention. This creates a perverse incentive.

Consider a model that's 60% confident in an answer:

Under Current Benchmarks:

  • Guess: 0.6 × 1 point + 0.4 × 0 points = 0.6 expected score

  • Say "I don't know": 0 points

  • Optimal strategy: Always guess, even when uncertain

With Confidence-Aware Scoring:

  • Guess: 0.6 × 1 point + 0.4 × (-3 points) = -0.6 expected score

  • Say "I don't know": 0 points

  • Optimal strategy: Only answer when confident

The paper shows that simply adding confidence thresholds to evaluation prompts—"Only answer if you're more than 75% confident"—could fundamentally change model behavior. Yet no major benchmark does this.

Where the Framework Falls Apart

While the theoretical insights are valuable, the paper's assumptions don't match how modern systems actually work.

Chain-of-Thought Breaks the Bounds

The paper assumes models map inputs directly to outputs. But prompting techniques transform the computation:

Take this letter-counting example:

Direct Approach: "How many D's are in DEEPSEEK?" → "3" (wrong, ~70% error rate)

Chain-of-Thought Approach: "Let me spell it out: D-E-E-P-S-E-E-K. Now counting D's: I see one D at the beginning. That's 1 D total." (correct, ~5% error rate)

This isn't just improvement—it's changing the problem from parallel pattern matching to sequential analysis. The theoretical bounds assume the former, but systems use the latter.

Retrieval Changes Everything

The framework analyzes models in isolation, but production systems use retrieval-augmented generation (RAG).

Without RAG:

  • Model must have memorized the fact

  • Accuracy for rare facts: ~3%

With RAG:

  • Model searches documents for the fact

  • Accuracy for rare facts: ~85%

The theoretical framework does not capture this 28x improvement. RAG systems operate in a different problem space where memorization limits don't apply.

Creative Tasks Have No Ground Truth

The paper's framework requires a clear set of "valid" outputs. But consider:

"Write a poem about butterflies"

There's no hallucination here—there's creation. The model isn't retrieving facts; it's generating novel content. The framework's bounds are meaningless when there's no objective truth to violate.

What Actually Reduces Hallucinations

Despite theoretical inevitability, practical techniques dramatically reduce hallucinations:

1. Frequency matters more than theory suggests

Facts appearing frequently in training almost never hallucinate:

  • "Capital of France" (appears >10,000 times): ~100% accuracy

  • "CEO of obscure startup" (appears once): ~20% accuracy

The theoretical bounds apply strongly only to rare facts.

2. Calibration can be improved

The paper treats calibration error as a minor constant. But recent models have reduced calibration error from 30% to 5% through better training. This dramatically tightens the theoretical bounds.

3. Problem Reformulation Works

Instead of asking "What year was X born?" (hard generation), ask a series of easier questions:

  • "Was X born before 1950?" (easy classification)

  • "Was X born in the 1960s?" (narrowing down)

  • "Was X born in 1963 or 1964?" (final precision)

Each step has much lower error rates than direct generation.

Practical Implications for AI Development

The paper's insights suggest three actionable changes:

For Evaluation Design

Stop using pure binary scoring. Implement confidence-aware metrics that:

  • Reward appropriate uncertainty

  • Penalize confident errors more than uncertain ones

  • Give partial credit for partially correct answers

For Model Training

Focus on calibration, not just accuracy. A model that's right 70% of the time but knows when it's uncertain is more useful than one that's right 75% of the time but always confident.

For System Architecture

Design around the limits:

  • Use RAG for factual queries

  • Use chain-of-thought for reasoning

  • Use confidence thresholds for high-stakes decisions

  • Build in explicit "I don't know" paths

When Hallucinations Actually Matter

Not all hallucinations are equal. The impact depends entirely on context:

Application

Stakes

Solution

Medical diagnosis, legal advice, financial decisions

High

Require high confidence thresholds, human oversight

Educational content, general Q&A

Medium

Include uncertainty indicators, cite sources

Creative writing, brainstorming, entertainment

Low

Hallucinations might actually be features, not bugs

The Reality Check

The paper proves something important but narrower than claimed. Yes, hallucinations are theoretically inevitable for certain types of queries under certain constraints. But those constraints aren't fixed laws of nature—they're engineering challenges.

Our learnings are:

  • Our benchmarks actively encourage hallucination through poor incentive design

  • Different problem types have different fundamental limits

  • Many "impossible" problems become tractable through clever reformulation

The history of computing is littered with proofs of theoretical impossibility that became engineering footnotes. Under the paper's assumptions, hallucinations may be mathematically inevitable, but we're already building systems that sidestep those assumptions entirely.

Focus on what's actionable: fix evaluation metrics, improve calibration, and design systems that know what they don't know. The math is interesting, but the engineering is what matters.

Try Galileo to find ways to break these theoretical limits and read in-depth eBook on how to:

  • Choose the right agentic framework for your use case

  • Evaluate and improve AI agent performance

  • Identify failure points and production issues

A recent paper from OpenAI "Why Language Models Hallucinate" claims to prove mathematically why language models hallucinate. Their argument is counterintuitive: even with perfect training data and infinite compute, models will confidently state falsehoods. The culprit isn't bad engineering—it's information theory.

The paper presents an elegant framework showing that hallucinations arise from how we train and evaluate models. But here's the thing: while the math is solid, the practical implications are more nuanced than the authors suggest. Modern techniques like retrieval-augmented generation and chain-of-thought prompting are already working around these theoretical limits.

Let's dig into what the paper actually proves, where the framework breaks down, and what it means for building AI systems.

The Core Insight: Generation Is Harder Than Verification

The paper's central contribution is proving that generating correct text is fundamentally harder than verifying whether text is correct. This seems obvious, but the mathematical formulation reveals surprising implications.

Consider asking a model about someone's birthday:

The Verification Task:

"Is Adam Kalai's birthday March 7th?"

The model just needs to output yes or no. Even random guessing gets 50% accuracy.

The Generation Task:

"What is Adam Kalai's birthday?"

The model must select from 365 possible dates. Random guessing gets 0.27% accuracy—about 180 times harder than verification.

The paper formalizes this into the Generation-Classification Inequality: generation error rate is always at least twice the classification error rate, minus a calibration term. When classification is imperfect (which it always is for unseen facts), generation must also be imperfect.

This inequality holds even with perfect training data. The model can memorize every fact it sees, but it still has to assign probabilities to facts it hasn't seen. And some of those assignments will be wrong.

Why Perfect Training Data Doesn't Solve Hallucinations

Here's where the paper gets interesting. Even if every single fact in your training data is true, models will still hallucinate about things not in that data.

Consider this training scenario:

  • Training data contains birthdays for 1,000 scientists

  • All dates are 100% accurate

  • Model achieves perfect accuracy on training data

Now ask: "When was obscure researcher Jane Smith born?"

The model has never seen Jane Smith's birthday. But it learned that scientists have birthdays distributed across the year. So it assigns probabilities to different dates based on patterns from the scientists it has seen. Whatever date it generates is essentially a educated guess based on statistical patterns—not a retrieved fact.

The paper proves this creates a hallucination floor. For facts appearing exactly once in training data (singletons), the hallucination rate is at least equal to the singleton rate. Empirical measurements show 20-30% of biographical facts are singletons in typical training corpora, predicting minimum 20-30% hallucination rates for rare facts.

This matches what we observe in practice. Models are highly accurate on well-represented facts but struggle with tail knowledge.

The Evaluation Problem Nobody Talks About

The paper's most actionable insight has nothing to do with theoretical limits. It's about how we score models.

Look at how major benchmarks handle uncertainty:

  • MMLU: Multiple choice, no credit for abstaining

  • GPQA: Multiple choice, no credit for abstaining

  • SWE-bench: Binary pass/fail on unit tests

Every major benchmark uses binary scoring that punishes abstention. This creates a perverse incentive.

Consider a model that's 60% confident in an answer:

Under Current Benchmarks:

  • Guess: 0.6 × 1 point + 0.4 × 0 points = 0.6 expected score

  • Say "I don't know": 0 points

  • Optimal strategy: Always guess, even when uncertain

With Confidence-Aware Scoring:

  • Guess: 0.6 × 1 point + 0.4 × (-3 points) = -0.6 expected score

  • Say "I don't know": 0 points

  • Optimal strategy: Only answer when confident

The paper shows that simply adding confidence thresholds to evaluation prompts—"Only answer if you're more than 75% confident"—could fundamentally change model behavior. Yet no major benchmark does this.

Where the Framework Falls Apart

While the theoretical insights are valuable, the paper's assumptions don't match how modern systems actually work.

Chain-of-Thought Breaks the Bounds

The paper assumes models map inputs directly to outputs. But prompting techniques transform the computation:

Take this letter-counting example:

Direct Approach: "How many D's are in DEEPSEEK?" → "3" (wrong, ~70% error rate)

Chain-of-Thought Approach: "Let me spell it out: D-E-E-P-S-E-E-K. Now counting D's: I see one D at the beginning. That's 1 D total." (correct, ~5% error rate)

This isn't just improvement—it's changing the problem from parallel pattern matching to sequential analysis. The theoretical bounds assume the former, but systems use the latter.

Retrieval Changes Everything

The framework analyzes models in isolation, but production systems use retrieval-augmented generation (RAG).

Without RAG:

  • Model must have memorized the fact

  • Accuracy for rare facts: ~3%

With RAG:

  • Model searches documents for the fact

  • Accuracy for rare facts: ~85%

The theoretical framework does not capture this 28x improvement. RAG systems operate in a different problem space where memorization limits don't apply.

Creative Tasks Have No Ground Truth

The paper's framework requires a clear set of "valid" outputs. But consider:

"Write a poem about butterflies"

There's no hallucination here—there's creation. The model isn't retrieving facts; it's generating novel content. The framework's bounds are meaningless when there's no objective truth to violate.

What Actually Reduces Hallucinations

Despite theoretical inevitability, practical techniques dramatically reduce hallucinations:

1. Frequency matters more than theory suggests

Facts appearing frequently in training almost never hallucinate:

  • "Capital of France" (appears >10,000 times): ~100% accuracy

  • "CEO of obscure startup" (appears once): ~20% accuracy

The theoretical bounds apply strongly only to rare facts.

2. Calibration can be improved

The paper treats calibration error as a minor constant. But recent models have reduced calibration error from 30% to 5% through better training. This dramatically tightens the theoretical bounds.

3. Problem Reformulation Works

Instead of asking "What year was X born?" (hard generation), ask a series of easier questions:

  • "Was X born before 1950?" (easy classification)

  • "Was X born in the 1960s?" (narrowing down)

  • "Was X born in 1963 or 1964?" (final precision)

Each step has much lower error rates than direct generation.

Practical Implications for AI Development

The paper's insights suggest three actionable changes:

For Evaluation Design

Stop using pure binary scoring. Implement confidence-aware metrics that:

  • Reward appropriate uncertainty

  • Penalize confident errors more than uncertain ones

  • Give partial credit for partially correct answers

For Model Training

Focus on calibration, not just accuracy. A model that's right 70% of the time but knows when it's uncertain is more useful than one that's right 75% of the time but always confident.

For System Architecture

Design around the limits:

  • Use RAG for factual queries

  • Use chain-of-thought for reasoning

  • Use confidence thresholds for high-stakes decisions

  • Build in explicit "I don't know" paths

When Hallucinations Actually Matter

Not all hallucinations are equal. The impact depends entirely on context:

Application

Stakes

Solution

Medical diagnosis, legal advice, financial decisions

High

Require high confidence thresholds, human oversight

Educational content, general Q&A

Medium

Include uncertainty indicators, cite sources

Creative writing, brainstorming, entertainment

Low

Hallucinations might actually be features, not bugs

The Reality Check

The paper proves something important but narrower than claimed. Yes, hallucinations are theoretically inevitable for certain types of queries under certain constraints. But those constraints aren't fixed laws of nature—they're engineering challenges.

Our learnings are:

  • Our benchmarks actively encourage hallucination through poor incentive design

  • Different problem types have different fundamental limits

  • Many "impossible" problems become tractable through clever reformulation

The history of computing is littered with proofs of theoretical impossibility that became engineering footnotes. Under the paper's assumptions, hallucinations may be mathematically inevitable, but we're already building systems that sidestep those assumptions entirely.

Focus on what's actionable: fix evaluation metrics, improve calibration, and design systems that know what they don't know. The math is interesting, but the engineering is what matters.

Try Galileo to find ways to break these theoretical limits and read in-depth eBook on how to:

  • Choose the right agentic framework for your use case

  • Evaluate and improve AI agent performance

  • Identify failure points and production issues

A recent paper from OpenAI "Why Language Models Hallucinate" claims to prove mathematically why language models hallucinate. Their argument is counterintuitive: even with perfect training data and infinite compute, models will confidently state falsehoods. The culprit isn't bad engineering—it's information theory.

The paper presents an elegant framework showing that hallucinations arise from how we train and evaluate models. But here's the thing: while the math is solid, the practical implications are more nuanced than the authors suggest. Modern techniques like retrieval-augmented generation and chain-of-thought prompting are already working around these theoretical limits.

Let's dig into what the paper actually proves, where the framework breaks down, and what it means for building AI systems.

The Core Insight: Generation Is Harder Than Verification

The paper's central contribution is proving that generating correct text is fundamentally harder than verifying whether text is correct. This seems obvious, but the mathematical formulation reveals surprising implications.

Consider asking a model about someone's birthday:

The Verification Task:

"Is Adam Kalai's birthday March 7th?"

The model just needs to output yes or no. Even random guessing gets 50% accuracy.

The Generation Task:

"What is Adam Kalai's birthday?"

The model must select from 365 possible dates. Random guessing gets 0.27% accuracy—about 180 times harder than verification.

The paper formalizes this into the Generation-Classification Inequality: generation error rate is always at least twice the classification error rate, minus a calibration term. When classification is imperfect (which it always is for unseen facts), generation must also be imperfect.

This inequality holds even with perfect training data. The model can memorize every fact it sees, but it still has to assign probabilities to facts it hasn't seen. And some of those assignments will be wrong.

Why Perfect Training Data Doesn't Solve Hallucinations

Here's where the paper gets interesting. Even if every single fact in your training data is true, models will still hallucinate about things not in that data.

Consider this training scenario:

  • Training data contains birthdays for 1,000 scientists

  • All dates are 100% accurate

  • Model achieves perfect accuracy on training data

Now ask: "When was obscure researcher Jane Smith born?"

The model has never seen Jane Smith's birthday. But it learned that scientists have birthdays distributed across the year. So it assigns probabilities to different dates based on patterns from the scientists it has seen. Whatever date it generates is essentially a educated guess based on statistical patterns—not a retrieved fact.

The paper proves this creates a hallucination floor. For facts appearing exactly once in training data (singletons), the hallucination rate is at least equal to the singleton rate. Empirical measurements show 20-30% of biographical facts are singletons in typical training corpora, predicting minimum 20-30% hallucination rates for rare facts.

This matches what we observe in practice. Models are highly accurate on well-represented facts but struggle with tail knowledge.

The Evaluation Problem Nobody Talks About

The paper's most actionable insight has nothing to do with theoretical limits. It's about how we score models.

Look at how major benchmarks handle uncertainty:

  • MMLU: Multiple choice, no credit for abstaining

  • GPQA: Multiple choice, no credit for abstaining

  • SWE-bench: Binary pass/fail on unit tests

Every major benchmark uses binary scoring that punishes abstention. This creates a perverse incentive.

Consider a model that's 60% confident in an answer:

Under Current Benchmarks:

  • Guess: 0.6 × 1 point + 0.4 × 0 points = 0.6 expected score

  • Say "I don't know": 0 points

  • Optimal strategy: Always guess, even when uncertain

With Confidence-Aware Scoring:

  • Guess: 0.6 × 1 point + 0.4 × (-3 points) = -0.6 expected score

  • Say "I don't know": 0 points

  • Optimal strategy: Only answer when confident

The paper shows that simply adding confidence thresholds to evaluation prompts—"Only answer if you're more than 75% confident"—could fundamentally change model behavior. Yet no major benchmark does this.

Where the Framework Falls Apart

While the theoretical insights are valuable, the paper's assumptions don't match how modern systems actually work.

Chain-of-Thought Breaks the Bounds

The paper assumes models map inputs directly to outputs. But prompting techniques transform the computation:

Take this letter-counting example:

Direct Approach: "How many D's are in DEEPSEEK?" → "3" (wrong, ~70% error rate)

Chain-of-Thought Approach: "Let me spell it out: D-E-E-P-S-E-E-K. Now counting D's: I see one D at the beginning. That's 1 D total." (correct, ~5% error rate)

This isn't just improvement—it's changing the problem from parallel pattern matching to sequential analysis. The theoretical bounds assume the former, but systems use the latter.

Retrieval Changes Everything

The framework analyzes models in isolation, but production systems use retrieval-augmented generation (RAG).

Without RAG:

  • Model must have memorized the fact

  • Accuracy for rare facts: ~3%

With RAG:

  • Model searches documents for the fact

  • Accuracy for rare facts: ~85%

The theoretical framework does not capture this 28x improvement. RAG systems operate in a different problem space where memorization limits don't apply.

Creative Tasks Have No Ground Truth

The paper's framework requires a clear set of "valid" outputs. But consider:

"Write a poem about butterflies"

There's no hallucination here—there's creation. The model isn't retrieving facts; it's generating novel content. The framework's bounds are meaningless when there's no objective truth to violate.

What Actually Reduces Hallucinations

Despite theoretical inevitability, practical techniques dramatically reduce hallucinations:

1. Frequency matters more than theory suggests

Facts appearing frequently in training almost never hallucinate:

  • "Capital of France" (appears >10,000 times): ~100% accuracy

  • "CEO of obscure startup" (appears once): ~20% accuracy

The theoretical bounds apply strongly only to rare facts.

2. Calibration can be improved

The paper treats calibration error as a minor constant. But recent models have reduced calibration error from 30% to 5% through better training. This dramatically tightens the theoretical bounds.

3. Problem Reformulation Works

Instead of asking "What year was X born?" (hard generation), ask a series of easier questions:

  • "Was X born before 1950?" (easy classification)

  • "Was X born in the 1960s?" (narrowing down)

  • "Was X born in 1963 or 1964?" (final precision)

Each step has much lower error rates than direct generation.

Practical Implications for AI Development

The paper's insights suggest three actionable changes:

For Evaluation Design

Stop using pure binary scoring. Implement confidence-aware metrics that:

  • Reward appropriate uncertainty

  • Penalize confident errors more than uncertain ones

  • Give partial credit for partially correct answers

For Model Training

Focus on calibration, not just accuracy. A model that's right 70% of the time but knows when it's uncertain is more useful than one that's right 75% of the time but always confident.

For System Architecture

Design around the limits:

  • Use RAG for factual queries

  • Use chain-of-thought for reasoning

  • Use confidence thresholds for high-stakes decisions

  • Build in explicit "I don't know" paths

When Hallucinations Actually Matter

Not all hallucinations are equal. The impact depends entirely on context:

Application

Stakes

Solution

Medical diagnosis, legal advice, financial decisions

High

Require high confidence thresholds, human oversight

Educational content, general Q&A

Medium

Include uncertainty indicators, cite sources

Creative writing, brainstorming, entertainment

Low

Hallucinations might actually be features, not bugs

The Reality Check

The paper proves something important but narrower than claimed. Yes, hallucinations are theoretically inevitable for certain types of queries under certain constraints. But those constraints aren't fixed laws of nature—they're engineering challenges.

Our learnings are:

  • Our benchmarks actively encourage hallucination through poor incentive design

  • Different problem types have different fundamental limits

  • Many "impossible" problems become tractable through clever reformulation

The history of computing is littered with proofs of theoretical impossibility that became engineering footnotes. Under the paper's assumptions, hallucinations may be mathematically inevitable, but we're already building systems that sidestep those assumptions entirely.

Focus on what's actionable: fix evaluation metrics, improve calibration, and design systems that know what they don't know. The math is interesting, but the engineering is what matters.

Try Galileo to find ways to break these theoretical limits and read in-depth eBook on how to:

  • Choose the right agentic framework for your use case

  • Evaluate and improve AI agent performance

  • Identify failure points and production issues

A recent paper from OpenAI "Why Language Models Hallucinate" claims to prove mathematically why language models hallucinate. Their argument is counterintuitive: even with perfect training data and infinite compute, models will confidently state falsehoods. The culprit isn't bad engineering—it's information theory.

The paper presents an elegant framework showing that hallucinations arise from how we train and evaluate models. But here's the thing: while the math is solid, the practical implications are more nuanced than the authors suggest. Modern techniques like retrieval-augmented generation and chain-of-thought prompting are already working around these theoretical limits.

Let's dig into what the paper actually proves, where the framework breaks down, and what it means for building AI systems.

The Core Insight: Generation Is Harder Than Verification

The paper's central contribution is proving that generating correct text is fundamentally harder than verifying whether text is correct. This seems obvious, but the mathematical formulation reveals surprising implications.

Consider asking a model about someone's birthday:

The Verification Task:

"Is Adam Kalai's birthday March 7th?"

The model just needs to output yes or no. Even random guessing gets 50% accuracy.

The Generation Task:

"What is Adam Kalai's birthday?"

The model must select from 365 possible dates. Random guessing gets 0.27% accuracy—about 180 times harder than verification.

The paper formalizes this into the Generation-Classification Inequality: generation error rate is always at least twice the classification error rate, minus a calibration term. When classification is imperfect (which it always is for unseen facts), generation must also be imperfect.

This inequality holds even with perfect training data. The model can memorize every fact it sees, but it still has to assign probabilities to facts it hasn't seen. And some of those assignments will be wrong.

Why Perfect Training Data Doesn't Solve Hallucinations

Here's where the paper gets interesting. Even if every single fact in your training data is true, models will still hallucinate about things not in that data.

Consider this training scenario:

  • Training data contains birthdays for 1,000 scientists

  • All dates are 100% accurate

  • Model achieves perfect accuracy on training data

Now ask: "When was obscure researcher Jane Smith born?"

The model has never seen Jane Smith's birthday. But it learned that scientists have birthdays distributed across the year. So it assigns probabilities to different dates based on patterns from the scientists it has seen. Whatever date it generates is essentially a educated guess based on statistical patterns—not a retrieved fact.

The paper proves this creates a hallucination floor. For facts appearing exactly once in training data (singletons), the hallucination rate is at least equal to the singleton rate. Empirical measurements show 20-30% of biographical facts are singletons in typical training corpora, predicting minimum 20-30% hallucination rates for rare facts.

This matches what we observe in practice. Models are highly accurate on well-represented facts but struggle with tail knowledge.

The Evaluation Problem Nobody Talks About

The paper's most actionable insight has nothing to do with theoretical limits. It's about how we score models.

Look at how major benchmarks handle uncertainty:

  • MMLU: Multiple choice, no credit for abstaining

  • GPQA: Multiple choice, no credit for abstaining

  • SWE-bench: Binary pass/fail on unit tests

Every major benchmark uses binary scoring that punishes abstention. This creates a perverse incentive.

Consider a model that's 60% confident in an answer:

Under Current Benchmarks:

  • Guess: 0.6 × 1 point + 0.4 × 0 points = 0.6 expected score

  • Say "I don't know": 0 points

  • Optimal strategy: Always guess, even when uncertain

With Confidence-Aware Scoring:

  • Guess: 0.6 × 1 point + 0.4 × (-3 points) = -0.6 expected score

  • Say "I don't know": 0 points

  • Optimal strategy: Only answer when confident

The paper shows that simply adding confidence thresholds to evaluation prompts—"Only answer if you're more than 75% confident"—could fundamentally change model behavior. Yet no major benchmark does this.

Where the Framework Falls Apart

While the theoretical insights are valuable, the paper's assumptions don't match how modern systems actually work.

Chain-of-Thought Breaks the Bounds

The paper assumes models map inputs directly to outputs. But prompting techniques transform the computation:

Take this letter-counting example:

Direct Approach: "How many D's are in DEEPSEEK?" → "3" (wrong, ~70% error rate)

Chain-of-Thought Approach: "Let me spell it out: D-E-E-P-S-E-E-K. Now counting D's: I see one D at the beginning. That's 1 D total." (correct, ~5% error rate)

This isn't just improvement—it's changing the problem from parallel pattern matching to sequential analysis. The theoretical bounds assume the former, but systems use the latter.

Retrieval Changes Everything

The framework analyzes models in isolation, but production systems use retrieval-augmented generation (RAG).

Without RAG:

  • Model must have memorized the fact

  • Accuracy for rare facts: ~3%

With RAG:

  • Model searches documents for the fact

  • Accuracy for rare facts: ~85%

The theoretical framework does not capture this 28x improvement. RAG systems operate in a different problem space where memorization limits don't apply.

Creative Tasks Have No Ground Truth

The paper's framework requires a clear set of "valid" outputs. But consider:

"Write a poem about butterflies"

There's no hallucination here—there's creation. The model isn't retrieving facts; it's generating novel content. The framework's bounds are meaningless when there's no objective truth to violate.

What Actually Reduces Hallucinations

Despite theoretical inevitability, practical techniques dramatically reduce hallucinations:

1. Frequency matters more than theory suggests

Facts appearing frequently in training almost never hallucinate:

  • "Capital of France" (appears >10,000 times): ~100% accuracy

  • "CEO of obscure startup" (appears once): ~20% accuracy

The theoretical bounds apply strongly only to rare facts.

2. Calibration can be improved

The paper treats calibration error as a minor constant. But recent models have reduced calibration error from 30% to 5% through better training. This dramatically tightens the theoretical bounds.

3. Problem Reformulation Works

Instead of asking "What year was X born?" (hard generation), ask a series of easier questions:

  • "Was X born before 1950?" (easy classification)

  • "Was X born in the 1960s?" (narrowing down)

  • "Was X born in 1963 or 1964?" (final precision)

Each step has much lower error rates than direct generation.

Practical Implications for AI Development

The paper's insights suggest three actionable changes:

For Evaluation Design

Stop using pure binary scoring. Implement confidence-aware metrics that:

  • Reward appropriate uncertainty

  • Penalize confident errors more than uncertain ones

  • Give partial credit for partially correct answers

For Model Training

Focus on calibration, not just accuracy. A model that's right 70% of the time but knows when it's uncertain is more useful than one that's right 75% of the time but always confident.

For System Architecture

Design around the limits:

  • Use RAG for factual queries

  • Use chain-of-thought for reasoning

  • Use confidence thresholds for high-stakes decisions

  • Build in explicit "I don't know" paths

When Hallucinations Actually Matter

Not all hallucinations are equal. The impact depends entirely on context:

Application

Stakes

Solution

Medical diagnosis, legal advice, financial decisions

High

Require high confidence thresholds, human oversight

Educational content, general Q&A

Medium

Include uncertainty indicators, cite sources

Creative writing, brainstorming, entertainment

Low

Hallucinations might actually be features, not bugs

The Reality Check

The paper proves something important but narrower than claimed. Yes, hallucinations are theoretically inevitable for certain types of queries under certain constraints. But those constraints aren't fixed laws of nature—they're engineering challenges.

Our learnings are:

  • Our benchmarks actively encourage hallucination through poor incentive design

  • Different problem types have different fundamental limits

  • Many "impossible" problems become tractable through clever reformulation

The history of computing is littered with proofs of theoretical impossibility that became engineering footnotes. Under the paper's assumptions, hallucinations may be mathematically inevitable, but we're already building systems that sidestep those assumptions entirely.

Focus on what's actionable: fix evaluation metrics, improve calibration, and design systems that know what they don't know. The math is interesting, but the engineering is what matters.

Try Galileo to find ways to break these theoretical limits and read in-depth eBook on how to:

  • Choose the right agentic framework for your use case

  • Evaluate and improve AI agent performance

  • Identify failure points and production issues

If you find this helpful and interesting,

Pratik Bhavsar