Aug 1, 2025

Why AI Still Score Low on Humanity's Last Exam

Conor Bronsdon

Head of Developer Awareness

Conor Bronsdon

Head of Developer Awareness

Think today's AI models are as smart as humans? Think again. On Humanity's Last Exam, the best AI systems barely score 30% while human grad students hit nearly 90%. This isn't just academic trivia—it directly impacts how much you can trust AI in critical situations, shapes regulatory discussions, and exposes where AI still falls short of real reasoning.

In this article, you'll discover the complete landscape of Humanity's Last Exam—from its innovative design and methodology to the surprising performance gaps between AI and humans.

We'll walk you through the benchmark's creation process, explain its unique value compared to older evaluations, and provide actionable insights about what these results mean for AI development and deployment. 

By the end, you'll understand exactly why this benchmark matters and how to use these findings to make smarter decisions about AI implementation.

We recently explored this topic on our Chain of Thought podcast, where industry experts shared practical insights and real-world implementation strategies: YouTube Embed

What Is the "Humanity's Last Exam" LLM Benchmark?

Humanity's Last Exam (HLE) is a rigorous AI benchmark consisting of 2,500-3,000 questions across more than 100 academic disciplines. Developed by the Center for AI Safety in collaboration with hundreds of subject-matter experts, HLE features graduate-level problems designed to evaluate genuine reasoning capabilities rather than simple pattern recognition or factual recall.

Each question has a clear-cut answer—multiple-choice or exact-match short answer—the model either gets it right or wrong. Since answers aren't available online, AI can't just search its way to success; it must genuinely reason.

This makes HLE a true "stress test" that stays challenging even as AI improves.

The creators call it the "final closed-ended academic benchmark," backed by real results: the best AI models score below 30%, while human experts reach nearly 90% on the same questions, according to the benchmark's research findings.

Money helped ensure quality. A global prize pool rewarded people who submitted the toughest questions and paid bounties for finding ambiguities. The result is an evolving dataset that resists memorization, rewards true reasoning, and remains relevant for years. 

For you, it provides a clear signal of whether an AI truly thinks—or just remembers.

Why Compare AI to Humans? Benchmark Motivation and Goals

Early AI benchmarks like the Turing Test asked if machines could pass for human. That worked fine until AI scores on standard tests shot past 90%, hiding real weaknesses. HLE exists because you need a measuring stick humans haven't already outgrown. By comparing AI to expert performance, the test reveals the gap between fluent-sounding outputs and actual understanding.

This human comparison serves three key purposes. It measures real reasoning, not pattern matching, showing whether an AI can handle problems that challenge actual scholars. Poor scores push researchers toward better approaches like tool use, multi-agent systems, and improved self-checking. 

The benchmark also gives you clear data when deciding how much to trust these systems in critical settings.

HLE doesn't replace real-world testing. You still have to track issues like hallucinations, bias, and safety issues in live systems, while HLE gives you the standardized test to run before deployment. 

Using both approaches provides you with the complete picture: how your model compares to human experts and how it behaves in real situations.

Inside the HLE Dataset: Subjects, Question Types and Difficulty Design

Look inside Humanity's Last Exam and you'll find nearly three thousand questions from over one hundred academic fields—far beyond what older benchmarks like MMLU covered. An international network of experts crafted every question, ensuring graduate-level difficulty while keeping answers objectively gradable. This breadth challenges even seasoned professionals.

Rather than open-ended essays, HLE uses focused formats. You'll encounter multiple-choice questions with carefully designed wrong answers and short-answer problems requiring exact responses. This structure allows automatic scoring of thousands of answers in seconds, eliminating the subjectivity that often affects human grading.

Quality control begins far upstream. The process—run with Scale AI—offered $500,000 in prizes for outstanding submissions and previously ran a bug-bounty program for spotting errors, which has now ended. 

Each potential question first faces testing against top AI models. Questions answered correctly get cut. Surviving questions then undergo expert review for clarity, fairness, and relevance.

This two-stage filter maintains the benchmark's difficulty, with an additional layer of protection through keeping some questions private. This prevents gaming the system and ensures future improvements reflect real progress, not memorization. When you test your own model on HLE, expect a blind-test experience similar to taking an actual final exam.

Multi-Modal & Adversarial Elements

Text doesn't capture everything. Some HLE problems include diagrams, charts, or images, requiring you to connect visual information with textual reasoning. This multi-modal aspect tests skills that text-only benchmarks miss but real applications demand.

Beyond multiple formats, question writers deliberately set traps. Throughout the exam, you'll encounter misleading patterns, tempting numerical coincidences, and plausible but wrong answer choices. Since every question is new and unavailable online, shortcuts like memorization or web searches fail, as shown in comprehensive HLE analyses.

Your model must truly understand concepts, follow complex logic, and work through calculations to succeed.

The result is a benchmark that punishes shallow pattern-matching and rewards deep understanding. If you're trying to close the gap between AI and expert human reasoning, HLE offers the toughest—and most informative—testing ground available.

Scoring and Evaluation Methodology

When you submit to Humanity's Last Exam, your model faces a straightforward evaluation: answers must be exactly correct for multiple-choice questions, and logically or mathematically equivalent for short-answer items. There's no partial credit, though equivalent short answers are accepted, creating objective scores you can compare across different systems.

Automatic grading powers the entire process. Since every question has a clear answer, simple scripts can evaluate thousands of responses in seconds, eliminating the subtle biases found in human-rated tests.

Early testing showed a dramatic performance gap: domain experts achieve high accuracy, while top AI models scored mostly below 10%. Your system will face questions specifically designed to resist internet lookups, and the zero-shot protocol allows no fine-tuning tricks or hints. 

You provide your prompt template, receive unseen test items, and that's it.

Results go to a private leaderboard with periodic public updates via GitHub. The maintainers deliberately hide portions of the dataset and release rankings in batches to prevent the gradual overfitting that corrupts many benchmarks. 

When scores improve, you can trust they reflect genuine advances rather than clever memorization.

Results to Date: AI Models vs. Human Baseline

The first leaderboard reveals a clear performance gap. Expert humans average nearly 90% accuracy, while frontier models like GPT-4 and Claude also achieve accuracy rates above 80% on identical questions. 

This gap persists despite trying different prompts and settings, pointing to fundamental limitations that clever prompting can't fix.

Scores vary significantly by question type. On text-only questions, top models reach the upper 20% range, but drop several points when diagrams or data tables are involved. This confirms multi-modal reasoning still trails behind text processing. 

Confidence issues make things worse—you'll find models often express high certainty about wrong answers, turning small factual errors into potentially harmful misinformation.

Specialized domains widen the gap further. In areas like advanced chemical kinetics, medieval philology, and graduate-level mathematics, the best AI models barely outperform random guessing, while domain experts comfortably score in the 80s and 90s. 

These same AI systems now reach 85–89% on the MMLU benchmark, showing why HLE was created—older tests no longer challenge cutting-edge AI.

The rankings keep changing. Since the research paper came out, developers have released updates that increase scores by a point or two, but no system has reached even half the human baseline. Progress continues, but the fundamental gap remains.

What HLE Scores Reveal About Modern AI Systems

These numbers tell a story you can't ignore. AI excels at breadth—recalling facts across hundreds of subjects and creating plausible answers instantly. This generalist ability explains why they perform well on familiar, pattern-heavy tests. HLE shows how shallow that knowledge often is.

While leading models beat humans on older benchmarks, they generally score under 30% on this new test.

Complex reasoning chains remain the key weakness. When you give AI multi-step proofs, hidden assumptions, and tricky wording, it stumbles because it relies on statistical patterns rather than strict logic.

The adversarial design also highlights calibration problems—you might hear confident answers that are completely wrong, creating risks that could become serious mistakes in real-world applications.

Multi-modal questions add another challenge. To interpret chemical diagrams or geological cross-sections, you need both visual understanding and textual analysis—skills humans develop through years of specialized training. Current AI-vision combinations still struggle with this integration, proving that "seeing" differs from understanding.

As a practitioner, the message is clear. These results warn you against assuming benchmark success equals real-world readiness. You'll need additional safeguards—robust monitoring, quick feedback loops, and domain-specific validation—before trusting AI with decisions that require expert judgment.

The benchmark doesn't just track progress; it pinpoints exactly where you need improvements for responsible AI deployment.

Addressing Common Misconceptions About AI–Human Benchmarks

Even as an experienced ML engineer, you might misinterpret benchmark results. Humanity's Last Exam clears up several persistent myths and changes how you should view any AI benchmark.

Myth 1: "AI beat humans on MMLU, so the problem is solved"

A high MMLU score just means an AI memorized or pattern-matched its way through an outdated dataset. While early top models that scored 90%+ on MMLU struggled to reach 30% on this evaluation, the newest models (like DeepResearch DR-5, GPT-4, and Claude 3) now score 79-87%, approaching the expert human performance of around 90%. This narrowing gap shows models are improving, but differences remain between pattern-matching and genuine reasoning.

Myth 2: "Benchmark results mirror real-world performance"

Static leaderboards miss real-life complications you'll face: shifting user inputs, prompt hijacking, unexpected edge cases. Real-time monitoring regularly catches hallucinations and unsafe outputs that never show up in test sets. Good benchmark scores don't eliminate your need for ongoing vigilance.

Myth 3: "Bigger models always win"

Size helps, but benefits decrease when questions require synthesis instead of recall. Massive models often plateau well below human performance. Better architectures or tool-assisted approaches may deliver more value than simply adding more parameters.

Myth 4: "AI reasoning already matches human reasoning"

HLE's retrieval-resistant, multi-modal questions force genuine thinking. You'll notice models regularly make logical errors, misread diagrams, or express unwarranted confidence—mistakes rarely made by expert humans. Until AI can catch these errors itself, treat its outputs as drafts, not final answers.

Myth 5: "Current evaluation methods fully capture AI capabilities"

Closed-ended tests reveal important flaws but miss creativity, long-term planning, and open-ended problem-solving. HLE calls itself the "final" closed-ended academic benchmark precisely because future progress needs richer, more dynamic testing. For real deployments, you should combine standardized tests with scenario-based evaluation and live monitoring to see the complete performance picture.

By dropping these misconceptions, you gain a clearer view of where today's AI truly shines, where it still stumbles, and how to use it responsibly.

Evaluate your LLM Against HLE and Other Benchmarks With Galileo

Running your model through Humanity's Last Exam shows you the gaps. Galileo is purpose-built to rigorously evaluate and monitor large language models at scale, processing your test results—from HLE, MMLU, or your own datasets—and delivering detailed analytics without weeks of manual work.

  • Automates LLM evaluation and eliminates manual review, saving you up to 80% of your evaluation time.

  • Rapidly identifies LLM issues like hallucinations, unsafe outputs, or prompt injection, with real-time metrics and robust guardrails for production reliability.

  • Empowers developers with flexible, out-of-the-box or custom metrics—enabling offline and live testing, root cause analysis, and seamless CI/CD integration.

  • Scales effortlessly to enterprise workloads, supporting cloud, on-premises, or hybrid deployments—while giving full visibility and control over your LLM applications.

Start exploring Galileo today to unlock the full potential of your models and ensure trustworthy AI at every stage.

Think today's AI models are as smart as humans? Think again. On Humanity's Last Exam, the best AI systems barely score 30% while human grad students hit nearly 90%. This isn't just academic trivia—it directly impacts how much you can trust AI in critical situations, shapes regulatory discussions, and exposes where AI still falls short of real reasoning.

In this article, you'll discover the complete landscape of Humanity's Last Exam—from its innovative design and methodology to the surprising performance gaps between AI and humans.

We'll walk you through the benchmark's creation process, explain its unique value compared to older evaluations, and provide actionable insights about what these results mean for AI development and deployment. 

By the end, you'll understand exactly why this benchmark matters and how to use these findings to make smarter decisions about AI implementation.

We recently explored this topic on our Chain of Thought podcast, where industry experts shared practical insights and real-world implementation strategies: YouTube Embed

What Is the "Humanity's Last Exam" LLM Benchmark?

Humanity's Last Exam (HLE) is a rigorous AI benchmark consisting of 2,500-3,000 questions across more than 100 academic disciplines. Developed by the Center for AI Safety in collaboration with hundreds of subject-matter experts, HLE features graduate-level problems designed to evaluate genuine reasoning capabilities rather than simple pattern recognition or factual recall.

Each question has a clear-cut answer—multiple-choice or exact-match short answer—the model either gets it right or wrong. Since answers aren't available online, AI can't just search its way to success; it must genuinely reason.

This makes HLE a true "stress test" that stays challenging even as AI improves.

The creators call it the "final closed-ended academic benchmark," backed by real results: the best AI models score below 30%, while human experts reach nearly 90% on the same questions, according to the benchmark's research findings.

Money helped ensure quality. A global prize pool rewarded people who submitted the toughest questions and paid bounties for finding ambiguities. The result is an evolving dataset that resists memorization, rewards true reasoning, and remains relevant for years. 

For you, it provides a clear signal of whether an AI truly thinks—or just remembers.

Why Compare AI to Humans? Benchmark Motivation and Goals

Early AI benchmarks like the Turing Test asked if machines could pass for human. That worked fine until AI scores on standard tests shot past 90%, hiding real weaknesses. HLE exists because you need a measuring stick humans haven't already outgrown. By comparing AI to expert performance, the test reveals the gap between fluent-sounding outputs and actual understanding.

This human comparison serves three key purposes. It measures real reasoning, not pattern matching, showing whether an AI can handle problems that challenge actual scholars. Poor scores push researchers toward better approaches like tool use, multi-agent systems, and improved self-checking. 

The benchmark also gives you clear data when deciding how much to trust these systems in critical settings.

HLE doesn't replace real-world testing. You still have to track issues like hallucinations, bias, and safety issues in live systems, while HLE gives you the standardized test to run before deployment. 

Using both approaches provides you with the complete picture: how your model compares to human experts and how it behaves in real situations.

Inside the HLE Dataset: Subjects, Question Types and Difficulty Design

Look inside Humanity's Last Exam and you'll find nearly three thousand questions from over one hundred academic fields—far beyond what older benchmarks like MMLU covered. An international network of experts crafted every question, ensuring graduate-level difficulty while keeping answers objectively gradable. This breadth challenges even seasoned professionals.

Rather than open-ended essays, HLE uses focused formats. You'll encounter multiple-choice questions with carefully designed wrong answers and short-answer problems requiring exact responses. This structure allows automatic scoring of thousands of answers in seconds, eliminating the subjectivity that often affects human grading.

Quality control begins far upstream. The process—run with Scale AI—offered $500,000 in prizes for outstanding submissions and previously ran a bug-bounty program for spotting errors, which has now ended. 

Each potential question first faces testing against top AI models. Questions answered correctly get cut. Surviving questions then undergo expert review for clarity, fairness, and relevance.

This two-stage filter maintains the benchmark's difficulty, with an additional layer of protection through keeping some questions private. This prevents gaming the system and ensures future improvements reflect real progress, not memorization. When you test your own model on HLE, expect a blind-test experience similar to taking an actual final exam.

Multi-Modal & Adversarial Elements

Text doesn't capture everything. Some HLE problems include diagrams, charts, or images, requiring you to connect visual information with textual reasoning. This multi-modal aspect tests skills that text-only benchmarks miss but real applications demand.

Beyond multiple formats, question writers deliberately set traps. Throughout the exam, you'll encounter misleading patterns, tempting numerical coincidences, and plausible but wrong answer choices. Since every question is new and unavailable online, shortcuts like memorization or web searches fail, as shown in comprehensive HLE analyses.

Your model must truly understand concepts, follow complex logic, and work through calculations to succeed.

The result is a benchmark that punishes shallow pattern-matching and rewards deep understanding. If you're trying to close the gap between AI and expert human reasoning, HLE offers the toughest—and most informative—testing ground available.

Scoring and Evaluation Methodology

When you submit to Humanity's Last Exam, your model faces a straightforward evaluation: answers must be exactly correct for multiple-choice questions, and logically or mathematically equivalent for short-answer items. There's no partial credit, though equivalent short answers are accepted, creating objective scores you can compare across different systems.

Automatic grading powers the entire process. Since every question has a clear answer, simple scripts can evaluate thousands of responses in seconds, eliminating the subtle biases found in human-rated tests.

Early testing showed a dramatic performance gap: domain experts achieve high accuracy, while top AI models scored mostly below 10%. Your system will face questions specifically designed to resist internet lookups, and the zero-shot protocol allows no fine-tuning tricks or hints. 

You provide your prompt template, receive unseen test items, and that's it.

Results go to a private leaderboard with periodic public updates via GitHub. The maintainers deliberately hide portions of the dataset and release rankings in batches to prevent the gradual overfitting that corrupts many benchmarks. 

When scores improve, you can trust they reflect genuine advances rather than clever memorization.

Results to Date: AI Models vs. Human Baseline

The first leaderboard reveals a clear performance gap. Expert humans average nearly 90% accuracy, while frontier models like GPT-4 and Claude also achieve accuracy rates above 80% on identical questions. 

This gap persists despite trying different prompts and settings, pointing to fundamental limitations that clever prompting can't fix.

Scores vary significantly by question type. On text-only questions, top models reach the upper 20% range, but drop several points when diagrams or data tables are involved. This confirms multi-modal reasoning still trails behind text processing. 

Confidence issues make things worse—you'll find models often express high certainty about wrong answers, turning small factual errors into potentially harmful misinformation.

Specialized domains widen the gap further. In areas like advanced chemical kinetics, medieval philology, and graduate-level mathematics, the best AI models barely outperform random guessing, while domain experts comfortably score in the 80s and 90s. 

These same AI systems now reach 85–89% on the MMLU benchmark, showing why HLE was created—older tests no longer challenge cutting-edge AI.

The rankings keep changing. Since the research paper came out, developers have released updates that increase scores by a point or two, but no system has reached even half the human baseline. Progress continues, but the fundamental gap remains.

What HLE Scores Reveal About Modern AI Systems

These numbers tell a story you can't ignore. AI excels at breadth—recalling facts across hundreds of subjects and creating plausible answers instantly. This generalist ability explains why they perform well on familiar, pattern-heavy tests. HLE shows how shallow that knowledge often is.

While leading models beat humans on older benchmarks, they generally score under 30% on this new test.

Complex reasoning chains remain the key weakness. When you give AI multi-step proofs, hidden assumptions, and tricky wording, it stumbles because it relies on statistical patterns rather than strict logic.

The adversarial design also highlights calibration problems—you might hear confident answers that are completely wrong, creating risks that could become serious mistakes in real-world applications.

Multi-modal questions add another challenge. To interpret chemical diagrams or geological cross-sections, you need both visual understanding and textual analysis—skills humans develop through years of specialized training. Current AI-vision combinations still struggle with this integration, proving that "seeing" differs from understanding.

As a practitioner, the message is clear. These results warn you against assuming benchmark success equals real-world readiness. You'll need additional safeguards—robust monitoring, quick feedback loops, and domain-specific validation—before trusting AI with decisions that require expert judgment.

The benchmark doesn't just track progress; it pinpoints exactly where you need improvements for responsible AI deployment.

Addressing Common Misconceptions About AI–Human Benchmarks

Even as an experienced ML engineer, you might misinterpret benchmark results. Humanity's Last Exam clears up several persistent myths and changes how you should view any AI benchmark.

Myth 1: "AI beat humans on MMLU, so the problem is solved"

A high MMLU score just means an AI memorized or pattern-matched its way through an outdated dataset. While early top models that scored 90%+ on MMLU struggled to reach 30% on this evaluation, the newest models (like DeepResearch DR-5, GPT-4, and Claude 3) now score 79-87%, approaching the expert human performance of around 90%. This narrowing gap shows models are improving, but differences remain between pattern-matching and genuine reasoning.

Myth 2: "Benchmark results mirror real-world performance"

Static leaderboards miss real-life complications you'll face: shifting user inputs, prompt hijacking, unexpected edge cases. Real-time monitoring regularly catches hallucinations and unsafe outputs that never show up in test sets. Good benchmark scores don't eliminate your need for ongoing vigilance.

Myth 3: "Bigger models always win"

Size helps, but benefits decrease when questions require synthesis instead of recall. Massive models often plateau well below human performance. Better architectures or tool-assisted approaches may deliver more value than simply adding more parameters.

Myth 4: "AI reasoning already matches human reasoning"

HLE's retrieval-resistant, multi-modal questions force genuine thinking. You'll notice models regularly make logical errors, misread diagrams, or express unwarranted confidence—mistakes rarely made by expert humans. Until AI can catch these errors itself, treat its outputs as drafts, not final answers.

Myth 5: "Current evaluation methods fully capture AI capabilities"

Closed-ended tests reveal important flaws but miss creativity, long-term planning, and open-ended problem-solving. HLE calls itself the "final" closed-ended academic benchmark precisely because future progress needs richer, more dynamic testing. For real deployments, you should combine standardized tests with scenario-based evaluation and live monitoring to see the complete performance picture.

By dropping these misconceptions, you gain a clearer view of where today's AI truly shines, where it still stumbles, and how to use it responsibly.

Evaluate your LLM Against HLE and Other Benchmarks With Galileo

Running your model through Humanity's Last Exam shows you the gaps. Galileo is purpose-built to rigorously evaluate and monitor large language models at scale, processing your test results—from HLE, MMLU, or your own datasets—and delivering detailed analytics without weeks of manual work.

  • Automates LLM evaluation and eliminates manual review, saving you up to 80% of your evaluation time.

  • Rapidly identifies LLM issues like hallucinations, unsafe outputs, or prompt injection, with real-time metrics and robust guardrails for production reliability.

  • Empowers developers with flexible, out-of-the-box or custom metrics—enabling offline and live testing, root cause analysis, and seamless CI/CD integration.

  • Scales effortlessly to enterprise workloads, supporting cloud, on-premises, or hybrid deployments—while giving full visibility and control over your LLM applications.

Start exploring Galileo today to unlock the full potential of your models and ensure trustworthy AI at every stage.

Think today's AI models are as smart as humans? Think again. On Humanity's Last Exam, the best AI systems barely score 30% while human grad students hit nearly 90%. This isn't just academic trivia—it directly impacts how much you can trust AI in critical situations, shapes regulatory discussions, and exposes where AI still falls short of real reasoning.

In this article, you'll discover the complete landscape of Humanity's Last Exam—from its innovative design and methodology to the surprising performance gaps between AI and humans.

We'll walk you through the benchmark's creation process, explain its unique value compared to older evaluations, and provide actionable insights about what these results mean for AI development and deployment. 

By the end, you'll understand exactly why this benchmark matters and how to use these findings to make smarter decisions about AI implementation.

We recently explored this topic on our Chain of Thought podcast, where industry experts shared practical insights and real-world implementation strategies: YouTube Embed

What Is the "Humanity's Last Exam" LLM Benchmark?

Humanity's Last Exam (HLE) is a rigorous AI benchmark consisting of 2,500-3,000 questions across more than 100 academic disciplines. Developed by the Center for AI Safety in collaboration with hundreds of subject-matter experts, HLE features graduate-level problems designed to evaluate genuine reasoning capabilities rather than simple pattern recognition or factual recall.

Each question has a clear-cut answer—multiple-choice or exact-match short answer—the model either gets it right or wrong. Since answers aren't available online, AI can't just search its way to success; it must genuinely reason.

This makes HLE a true "stress test" that stays challenging even as AI improves.

The creators call it the "final closed-ended academic benchmark," backed by real results: the best AI models score below 30%, while human experts reach nearly 90% on the same questions, according to the benchmark's research findings.

Money helped ensure quality. A global prize pool rewarded people who submitted the toughest questions and paid bounties for finding ambiguities. The result is an evolving dataset that resists memorization, rewards true reasoning, and remains relevant for years. 

For you, it provides a clear signal of whether an AI truly thinks—or just remembers.

Why Compare AI to Humans? Benchmark Motivation and Goals

Early AI benchmarks like the Turing Test asked if machines could pass for human. That worked fine until AI scores on standard tests shot past 90%, hiding real weaknesses. HLE exists because you need a measuring stick humans haven't already outgrown. By comparing AI to expert performance, the test reveals the gap between fluent-sounding outputs and actual understanding.

This human comparison serves three key purposes. It measures real reasoning, not pattern matching, showing whether an AI can handle problems that challenge actual scholars. Poor scores push researchers toward better approaches like tool use, multi-agent systems, and improved self-checking. 

The benchmark also gives you clear data when deciding how much to trust these systems in critical settings.

HLE doesn't replace real-world testing. You still have to track issues like hallucinations, bias, and safety issues in live systems, while HLE gives you the standardized test to run before deployment. 

Using both approaches provides you with the complete picture: how your model compares to human experts and how it behaves in real situations.

Inside the HLE Dataset: Subjects, Question Types and Difficulty Design

Look inside Humanity's Last Exam and you'll find nearly three thousand questions from over one hundred academic fields—far beyond what older benchmarks like MMLU covered. An international network of experts crafted every question, ensuring graduate-level difficulty while keeping answers objectively gradable. This breadth challenges even seasoned professionals.

Rather than open-ended essays, HLE uses focused formats. You'll encounter multiple-choice questions with carefully designed wrong answers and short-answer problems requiring exact responses. This structure allows automatic scoring of thousands of answers in seconds, eliminating the subjectivity that often affects human grading.

Quality control begins far upstream. The process—run with Scale AI—offered $500,000 in prizes for outstanding submissions and previously ran a bug-bounty program for spotting errors, which has now ended. 

Each potential question first faces testing against top AI models. Questions answered correctly get cut. Surviving questions then undergo expert review for clarity, fairness, and relevance.

This two-stage filter maintains the benchmark's difficulty, with an additional layer of protection through keeping some questions private. This prevents gaming the system and ensures future improvements reflect real progress, not memorization. When you test your own model on HLE, expect a blind-test experience similar to taking an actual final exam.

Multi-Modal & Adversarial Elements

Text doesn't capture everything. Some HLE problems include diagrams, charts, or images, requiring you to connect visual information with textual reasoning. This multi-modal aspect tests skills that text-only benchmarks miss but real applications demand.

Beyond multiple formats, question writers deliberately set traps. Throughout the exam, you'll encounter misleading patterns, tempting numerical coincidences, and plausible but wrong answer choices. Since every question is new and unavailable online, shortcuts like memorization or web searches fail, as shown in comprehensive HLE analyses.

Your model must truly understand concepts, follow complex logic, and work through calculations to succeed.

The result is a benchmark that punishes shallow pattern-matching and rewards deep understanding. If you're trying to close the gap between AI and expert human reasoning, HLE offers the toughest—and most informative—testing ground available.

Scoring and Evaluation Methodology

When you submit to Humanity's Last Exam, your model faces a straightforward evaluation: answers must be exactly correct for multiple-choice questions, and logically or mathematically equivalent for short-answer items. There's no partial credit, though equivalent short answers are accepted, creating objective scores you can compare across different systems.

Automatic grading powers the entire process. Since every question has a clear answer, simple scripts can evaluate thousands of responses in seconds, eliminating the subtle biases found in human-rated tests.

Early testing showed a dramatic performance gap: domain experts achieve high accuracy, while top AI models scored mostly below 10%. Your system will face questions specifically designed to resist internet lookups, and the zero-shot protocol allows no fine-tuning tricks or hints. 

You provide your prompt template, receive unseen test items, and that's it.

Results go to a private leaderboard with periodic public updates via GitHub. The maintainers deliberately hide portions of the dataset and release rankings in batches to prevent the gradual overfitting that corrupts many benchmarks. 

When scores improve, you can trust they reflect genuine advances rather than clever memorization.

Results to Date: AI Models vs. Human Baseline

The first leaderboard reveals a clear performance gap. Expert humans average nearly 90% accuracy, while frontier models like GPT-4 and Claude also achieve accuracy rates above 80% on identical questions. 

This gap persists despite trying different prompts and settings, pointing to fundamental limitations that clever prompting can't fix.

Scores vary significantly by question type. On text-only questions, top models reach the upper 20% range, but drop several points when diagrams or data tables are involved. This confirms multi-modal reasoning still trails behind text processing. 

Confidence issues make things worse—you'll find models often express high certainty about wrong answers, turning small factual errors into potentially harmful misinformation.

Specialized domains widen the gap further. In areas like advanced chemical kinetics, medieval philology, and graduate-level mathematics, the best AI models barely outperform random guessing, while domain experts comfortably score in the 80s and 90s. 

These same AI systems now reach 85–89% on the MMLU benchmark, showing why HLE was created—older tests no longer challenge cutting-edge AI.

The rankings keep changing. Since the research paper came out, developers have released updates that increase scores by a point or two, but no system has reached even half the human baseline. Progress continues, but the fundamental gap remains.

What HLE Scores Reveal About Modern AI Systems

These numbers tell a story you can't ignore. AI excels at breadth—recalling facts across hundreds of subjects and creating plausible answers instantly. This generalist ability explains why they perform well on familiar, pattern-heavy tests. HLE shows how shallow that knowledge often is.

While leading models beat humans on older benchmarks, they generally score under 30% on this new test.

Complex reasoning chains remain the key weakness. When you give AI multi-step proofs, hidden assumptions, and tricky wording, it stumbles because it relies on statistical patterns rather than strict logic.

The adversarial design also highlights calibration problems—you might hear confident answers that are completely wrong, creating risks that could become serious mistakes in real-world applications.

Multi-modal questions add another challenge. To interpret chemical diagrams or geological cross-sections, you need both visual understanding and textual analysis—skills humans develop through years of specialized training. Current AI-vision combinations still struggle with this integration, proving that "seeing" differs from understanding.

As a practitioner, the message is clear. These results warn you against assuming benchmark success equals real-world readiness. You'll need additional safeguards—robust monitoring, quick feedback loops, and domain-specific validation—before trusting AI with decisions that require expert judgment.

The benchmark doesn't just track progress; it pinpoints exactly where you need improvements for responsible AI deployment.

Addressing Common Misconceptions About AI–Human Benchmarks

Even as an experienced ML engineer, you might misinterpret benchmark results. Humanity's Last Exam clears up several persistent myths and changes how you should view any AI benchmark.

Myth 1: "AI beat humans on MMLU, so the problem is solved"

A high MMLU score just means an AI memorized or pattern-matched its way through an outdated dataset. While early top models that scored 90%+ on MMLU struggled to reach 30% on this evaluation, the newest models (like DeepResearch DR-5, GPT-4, and Claude 3) now score 79-87%, approaching the expert human performance of around 90%. This narrowing gap shows models are improving, but differences remain between pattern-matching and genuine reasoning.

Myth 2: "Benchmark results mirror real-world performance"

Static leaderboards miss real-life complications you'll face: shifting user inputs, prompt hijacking, unexpected edge cases. Real-time monitoring regularly catches hallucinations and unsafe outputs that never show up in test sets. Good benchmark scores don't eliminate your need for ongoing vigilance.

Myth 3: "Bigger models always win"

Size helps, but benefits decrease when questions require synthesis instead of recall. Massive models often plateau well below human performance. Better architectures or tool-assisted approaches may deliver more value than simply adding more parameters.

Myth 4: "AI reasoning already matches human reasoning"

HLE's retrieval-resistant, multi-modal questions force genuine thinking. You'll notice models regularly make logical errors, misread diagrams, or express unwarranted confidence—mistakes rarely made by expert humans. Until AI can catch these errors itself, treat its outputs as drafts, not final answers.

Myth 5: "Current evaluation methods fully capture AI capabilities"

Closed-ended tests reveal important flaws but miss creativity, long-term planning, and open-ended problem-solving. HLE calls itself the "final" closed-ended academic benchmark precisely because future progress needs richer, more dynamic testing. For real deployments, you should combine standardized tests with scenario-based evaluation and live monitoring to see the complete performance picture.

By dropping these misconceptions, you gain a clearer view of where today's AI truly shines, where it still stumbles, and how to use it responsibly.

Evaluate your LLM Against HLE and Other Benchmarks With Galileo

Running your model through Humanity's Last Exam shows you the gaps. Galileo is purpose-built to rigorously evaluate and monitor large language models at scale, processing your test results—from HLE, MMLU, or your own datasets—and delivering detailed analytics without weeks of manual work.

  • Automates LLM evaluation and eliminates manual review, saving you up to 80% of your evaluation time.

  • Rapidly identifies LLM issues like hallucinations, unsafe outputs, or prompt injection, with real-time metrics and robust guardrails for production reliability.

  • Empowers developers with flexible, out-of-the-box or custom metrics—enabling offline and live testing, root cause analysis, and seamless CI/CD integration.

  • Scales effortlessly to enterprise workloads, supporting cloud, on-premises, or hybrid deployments—while giving full visibility and control over your LLM applications.

Start exploring Galileo today to unlock the full potential of your models and ensure trustworthy AI at every stage.

Think today's AI models are as smart as humans? Think again. On Humanity's Last Exam, the best AI systems barely score 30% while human grad students hit nearly 90%. This isn't just academic trivia—it directly impacts how much you can trust AI in critical situations, shapes regulatory discussions, and exposes where AI still falls short of real reasoning.

In this article, you'll discover the complete landscape of Humanity's Last Exam—from its innovative design and methodology to the surprising performance gaps between AI and humans.

We'll walk you through the benchmark's creation process, explain its unique value compared to older evaluations, and provide actionable insights about what these results mean for AI development and deployment. 

By the end, you'll understand exactly why this benchmark matters and how to use these findings to make smarter decisions about AI implementation.

We recently explored this topic on our Chain of Thought podcast, where industry experts shared practical insights and real-world implementation strategies: YouTube Embed

What Is the "Humanity's Last Exam" LLM Benchmark?

Humanity's Last Exam (HLE) is a rigorous AI benchmark consisting of 2,500-3,000 questions across more than 100 academic disciplines. Developed by the Center for AI Safety in collaboration with hundreds of subject-matter experts, HLE features graduate-level problems designed to evaluate genuine reasoning capabilities rather than simple pattern recognition or factual recall.

Each question has a clear-cut answer—multiple-choice or exact-match short answer—the model either gets it right or wrong. Since answers aren't available online, AI can't just search its way to success; it must genuinely reason.

This makes HLE a true "stress test" that stays challenging even as AI improves.

The creators call it the "final closed-ended academic benchmark," backed by real results: the best AI models score below 30%, while human experts reach nearly 90% on the same questions, according to the benchmark's research findings.

Money helped ensure quality. A global prize pool rewarded people who submitted the toughest questions and paid bounties for finding ambiguities. The result is an evolving dataset that resists memorization, rewards true reasoning, and remains relevant for years. 

For you, it provides a clear signal of whether an AI truly thinks—or just remembers.

Why Compare AI to Humans? Benchmark Motivation and Goals

Early AI benchmarks like the Turing Test asked if machines could pass for human. That worked fine until AI scores on standard tests shot past 90%, hiding real weaknesses. HLE exists because you need a measuring stick humans haven't already outgrown. By comparing AI to expert performance, the test reveals the gap between fluent-sounding outputs and actual understanding.

This human comparison serves three key purposes. It measures real reasoning, not pattern matching, showing whether an AI can handle problems that challenge actual scholars. Poor scores push researchers toward better approaches like tool use, multi-agent systems, and improved self-checking. 

The benchmark also gives you clear data when deciding how much to trust these systems in critical settings.

HLE doesn't replace real-world testing. You still have to track issues like hallucinations, bias, and safety issues in live systems, while HLE gives you the standardized test to run before deployment. 

Using both approaches provides you with the complete picture: how your model compares to human experts and how it behaves in real situations.

Inside the HLE Dataset: Subjects, Question Types and Difficulty Design

Look inside Humanity's Last Exam and you'll find nearly three thousand questions from over one hundred academic fields—far beyond what older benchmarks like MMLU covered. An international network of experts crafted every question, ensuring graduate-level difficulty while keeping answers objectively gradable. This breadth challenges even seasoned professionals.

Rather than open-ended essays, HLE uses focused formats. You'll encounter multiple-choice questions with carefully designed wrong answers and short-answer problems requiring exact responses. This structure allows automatic scoring of thousands of answers in seconds, eliminating the subjectivity that often affects human grading.

Quality control begins far upstream. The process—run with Scale AI—offered $500,000 in prizes for outstanding submissions and previously ran a bug-bounty program for spotting errors, which has now ended. 

Each potential question first faces testing against top AI models. Questions answered correctly get cut. Surviving questions then undergo expert review for clarity, fairness, and relevance.

This two-stage filter maintains the benchmark's difficulty, with an additional layer of protection through keeping some questions private. This prevents gaming the system and ensures future improvements reflect real progress, not memorization. When you test your own model on HLE, expect a blind-test experience similar to taking an actual final exam.

Multi-Modal & Adversarial Elements

Text doesn't capture everything. Some HLE problems include diagrams, charts, or images, requiring you to connect visual information with textual reasoning. This multi-modal aspect tests skills that text-only benchmarks miss but real applications demand.

Beyond multiple formats, question writers deliberately set traps. Throughout the exam, you'll encounter misleading patterns, tempting numerical coincidences, and plausible but wrong answer choices. Since every question is new and unavailable online, shortcuts like memorization or web searches fail, as shown in comprehensive HLE analyses.

Your model must truly understand concepts, follow complex logic, and work through calculations to succeed.

The result is a benchmark that punishes shallow pattern-matching and rewards deep understanding. If you're trying to close the gap between AI and expert human reasoning, HLE offers the toughest—and most informative—testing ground available.

Scoring and Evaluation Methodology

When you submit to Humanity's Last Exam, your model faces a straightforward evaluation: answers must be exactly correct for multiple-choice questions, and logically or mathematically equivalent for short-answer items. There's no partial credit, though equivalent short answers are accepted, creating objective scores you can compare across different systems.

Automatic grading powers the entire process. Since every question has a clear answer, simple scripts can evaluate thousands of responses in seconds, eliminating the subtle biases found in human-rated tests.

Early testing showed a dramatic performance gap: domain experts achieve high accuracy, while top AI models scored mostly below 10%. Your system will face questions specifically designed to resist internet lookups, and the zero-shot protocol allows no fine-tuning tricks or hints. 

You provide your prompt template, receive unseen test items, and that's it.

Results go to a private leaderboard with periodic public updates via GitHub. The maintainers deliberately hide portions of the dataset and release rankings in batches to prevent the gradual overfitting that corrupts many benchmarks. 

When scores improve, you can trust they reflect genuine advances rather than clever memorization.

Results to Date: AI Models vs. Human Baseline

The first leaderboard reveals a clear performance gap. Expert humans average nearly 90% accuracy, while frontier models like GPT-4 and Claude also achieve accuracy rates above 80% on identical questions. 

This gap persists despite trying different prompts and settings, pointing to fundamental limitations that clever prompting can't fix.

Scores vary significantly by question type. On text-only questions, top models reach the upper 20% range, but drop several points when diagrams or data tables are involved. This confirms multi-modal reasoning still trails behind text processing. 

Confidence issues make things worse—you'll find models often express high certainty about wrong answers, turning small factual errors into potentially harmful misinformation.

Specialized domains widen the gap further. In areas like advanced chemical kinetics, medieval philology, and graduate-level mathematics, the best AI models barely outperform random guessing, while domain experts comfortably score in the 80s and 90s. 

These same AI systems now reach 85–89% on the MMLU benchmark, showing why HLE was created—older tests no longer challenge cutting-edge AI.

The rankings keep changing. Since the research paper came out, developers have released updates that increase scores by a point or two, but no system has reached even half the human baseline. Progress continues, but the fundamental gap remains.

What HLE Scores Reveal About Modern AI Systems

These numbers tell a story you can't ignore. AI excels at breadth—recalling facts across hundreds of subjects and creating plausible answers instantly. This generalist ability explains why they perform well on familiar, pattern-heavy tests. HLE shows how shallow that knowledge often is.

While leading models beat humans on older benchmarks, they generally score under 30% on this new test.

Complex reasoning chains remain the key weakness. When you give AI multi-step proofs, hidden assumptions, and tricky wording, it stumbles because it relies on statistical patterns rather than strict logic.

The adversarial design also highlights calibration problems—you might hear confident answers that are completely wrong, creating risks that could become serious mistakes in real-world applications.

Multi-modal questions add another challenge. To interpret chemical diagrams or geological cross-sections, you need both visual understanding and textual analysis—skills humans develop through years of specialized training. Current AI-vision combinations still struggle with this integration, proving that "seeing" differs from understanding.

As a practitioner, the message is clear. These results warn you against assuming benchmark success equals real-world readiness. You'll need additional safeguards—robust monitoring, quick feedback loops, and domain-specific validation—before trusting AI with decisions that require expert judgment.

The benchmark doesn't just track progress; it pinpoints exactly where you need improvements for responsible AI deployment.

Addressing Common Misconceptions About AI–Human Benchmarks

Even as an experienced ML engineer, you might misinterpret benchmark results. Humanity's Last Exam clears up several persistent myths and changes how you should view any AI benchmark.

Myth 1: "AI beat humans on MMLU, so the problem is solved"

A high MMLU score just means an AI memorized or pattern-matched its way through an outdated dataset. While early top models that scored 90%+ on MMLU struggled to reach 30% on this evaluation, the newest models (like DeepResearch DR-5, GPT-4, and Claude 3) now score 79-87%, approaching the expert human performance of around 90%. This narrowing gap shows models are improving, but differences remain between pattern-matching and genuine reasoning.

Myth 2: "Benchmark results mirror real-world performance"

Static leaderboards miss real-life complications you'll face: shifting user inputs, prompt hijacking, unexpected edge cases. Real-time monitoring regularly catches hallucinations and unsafe outputs that never show up in test sets. Good benchmark scores don't eliminate your need for ongoing vigilance.

Myth 3: "Bigger models always win"

Size helps, but benefits decrease when questions require synthesis instead of recall. Massive models often plateau well below human performance. Better architectures or tool-assisted approaches may deliver more value than simply adding more parameters.

Myth 4: "AI reasoning already matches human reasoning"

HLE's retrieval-resistant, multi-modal questions force genuine thinking. You'll notice models regularly make logical errors, misread diagrams, or express unwarranted confidence—mistakes rarely made by expert humans. Until AI can catch these errors itself, treat its outputs as drafts, not final answers.

Myth 5: "Current evaluation methods fully capture AI capabilities"

Closed-ended tests reveal important flaws but miss creativity, long-term planning, and open-ended problem-solving. HLE calls itself the "final" closed-ended academic benchmark precisely because future progress needs richer, more dynamic testing. For real deployments, you should combine standardized tests with scenario-based evaluation and live monitoring to see the complete performance picture.

By dropping these misconceptions, you gain a clearer view of where today's AI truly shines, where it still stumbles, and how to use it responsibly.

Evaluate your LLM Against HLE and Other Benchmarks With Galileo

Running your model through Humanity's Last Exam shows you the gaps. Galileo is purpose-built to rigorously evaluate and monitor large language models at scale, processing your test results—from HLE, MMLU, or your own datasets—and delivering detailed analytics without weeks of manual work.

  • Automates LLM evaluation and eliminates manual review, saving you up to 80% of your evaluation time.

  • Rapidly identifies LLM issues like hallucinations, unsafe outputs, or prompt injection, with real-time metrics and robust guardrails for production reliability.

  • Empowers developers with flexible, out-of-the-box or custom metrics—enabling offline and live testing, root cause analysis, and seamless CI/CD integration.

  • Scales effortlessly to enterprise workloads, supporting cloud, on-premises, or hybrid deployments—while giving full visibility and control over your LLM applications.

Start exploring Galileo today to unlock the full potential of your models and ensure trustworthy AI at every stage.

Conor Bronsdon

Conor Bronsdon

Conor Bronsdon

Conor Bronsdon