F1 Score in AI Evaluation How to Balance Precision and Recall

Your toxicity classifier reports 95% accuracy. It looks solid in the dashboard. But when you dig into the results, it is missing 40% of actually toxic outputs that reach your customers. Accuracy hides failures on imbalanced data by rewarding correct predictions on the majority class while obscuring catastrophic misses on the class you care about most.
The F1 Score forces precision and recall into balance, exposing where your model succeeds and where it breaks down. This article covers what F1 is, how to calculate it, which variant fits your use case, where it falls short for modern LLM evals, and how to build an eval strategy that goes beyond any single metric.
TLDR:
F1 is the harmonic mean of precision and recall, penalizing lopsided models.
Use F1 over accuracy when your data is imbalanced.
Choose the right variant (Macro, Micro, Weighted, Fβ) for your use case.
F1 suits classification-based safety metrics but not generative quality evals.
Production AI requires multi-metric evals combining safety, quality, and observability.
What Is the F1 Score?
The F1 Score is a metric that evaluates the balance between precision (how many of your predicted positives are actually correct) and recall (how many actual positives your model successfully identifies). It is the harmonic mean of these two values, producing a score between 0 and 1.
Unlike accuracy, F1 excludes true negatives from its calculation. This makes it far more reliable when one class heavily outnumbers the other, which is common in fraud detection, medical diagnosis, and security threat classification, and can also arise in some LLM safety eval settings. If either precision or recall drops significantly, the F1 Score drops with it, ensuring you cannot hide poor minority-class performance behind majority-class dominance.
How to Calculate the F1 Score
Computing the F1 Score requires three steps: calculate precision, calculate recall, then combine them using the harmonic mean formula.
F1 = 2 × (Precision × Recall) / (Precision + Recall)
The harmonic mean is critical here. Unlike the arithmetic mean, it penalizes extreme imbalances between precision and recall. If your precision is 0.95 but recall is 0.20, the arithmetic mean would be 0.575, suggesting decent performance. The harmonic mean gives you approximately 0.33, correctly flagging that your model is failing to catch most positive cases.
Worked example: Suppose your safety classifier flags 100 outputs as toxic. Of those, 80 are genuinely toxic and 20 are false positives. Meanwhile, there were 120 actually toxic outputs in total, meaning the model missed 40.
Precision = 80 / (80 + 20) = 0.80
Recall = 80 / (80 + 40) = 0.67
F1 Score = 2 × (0.80 × 0.67) / (0.80 + 0.67) = 0.73
An F1 of 0.73 tells you the model maintains a reasonable balance, but that recall gap (missing one-third of toxic outputs) demands attention. Without F1, you might see 80% precision and call it a success, while toxic content reaches your users undetected.
When to Use the F1 Score in AI Evaluation
You should reach for F1 when accuracy would blur the signal you actually need. That usually happens when one class is rare, when false positives and false negatives both matter, or when you need a clearer view of minority-class behavior.
Imbalanced Classification Tasks
When one class significantly outnumbers another, accuracy becomes unreliable. A fraud detection model processing transactions where only 0.1% are fraudulent can achieve 99.9% accuracy by predicting "legitimate" every time, while catching zero fraud. As Google's ML guide states, F1 is the preferred metric for class-imbalanced datasets because it forces the score to reflect minority-class performance.
This applies equally to LLM safety metrics. Prompt injection attacks typically represent less than 1% of production traffic. PII detection events are rare but high-impact. Toxicity in user-facing outputs may appear in a small fraction of responses. In each case, you need a metric that evaluates how well your model handles the minority class, not how well it predicts the majority.
When your production traffic is dominated by safe inputs and only a thin tail contains the harmful cases, accuracy will tell you everything is fine, while the failures that matter most go unmeasured. The practical implication is straightforward: if you are deploying any binary classifier where one class is significantly rarer, F1 should replace accuracy as your primary eval metric.
Multi-Class And Multi-Label Evaluation
When your classification task spans multiple categories, such as sentiment analysis with five classes, intent classification across dozens of intents, or tone detection across nine emotional categories, F1 variants provide more granular visibility into per-class and aggregate performance. That matters when some classes are underrepresented but still carry high business significance.
Consider a scenario where your intent classifier handles 50 intents, but three of them (account cancellation, billing dispute, data deletion) drive most of your escalation volume. Standard accuracy treats all intents equally, so strong performance on common intents can mask failures on the high-stakes ones.
F1 variants give you the tools to weight, isolate, or average class performance based on what your team actually needs to optimize. In practice, the right variant depends on what you want the score to emphasize, which is why choosing the variant matters as much as deciding to use F1 in the first place.
F1 Score Variants And How to Choose
Standard F1 is built for binary classification, but many real-world workloads are not binary. If you are evaluating multiple classes, uneven class frequencies, or cases where recall matters more than precision, you need a different aggregation or weighting scheme.
The variants below solve different decision problems. Some help you compare class performance fairly. Others help you match the score to the operational cost of false positives or false negatives.
Macro, Micro And Weighted F1
These three variants address multi-class evals differently based on whether you prioritize equal class treatment or aggregate behavior.
Macro F1 calculates F1 separately for each class and averages without weighting. Every class contributes equally regardless of size. Use this when all classes matter equally, such as multi-category sentiment analysis.
Micro F1 aggregates all true positives, false positives, and false negatives across classes before computing a single F1. This is ideal when you care about aggregate correctness more than per-class fairness.
Weighted F1 extends Macro F1 by weighting each class by its frequency. More common classes contribute proportionally more. This fits imbalanced multi-class datasets where rare classes should not be ignored but common classes still need accurate representation.
Scenario | Best Variant | Why |
Equal class importance | Macro F1 | Treats all classes equally |
Large datasets, overall performance | Micro F1 | Evaluates the model as a whole |
Imbalanced datasets, frequency matters | Weighted F1 | Ensures fair representation |
Precision is more important | F0.5 Score | Reduces false positives |
Recall is more important | F2 Score | Reduces false negatives |
Fβ Score for Precision Recall Prioritization
The Fβ Score modifies the standard F1 by shifting emphasis toward precision or recall based on your application's risk profile. The β parameter controls the trade-off: values below 1 favor precision, values above 1 favor recall.
F0.5 (precision-heavy): Weights precision higher than recall. Use this in spam detection or content moderation where false positives, such as blocking legitimate content, directly degrade your user experience. Overzealous filtering frustrates your users more than the occasional spam message slipping through, so you want the model to be confident before it acts.
F2 (recall-heavy): Weights recall higher than precision. Use this in medical screening or security threat detection where missing a true positive is more costly than generating false alarms.
In prompt injection detection, for example, a missed attack can lead to data exfiltration or unauthorized actions, making F2 a more appropriate target than standard F1. The cost asymmetry between a false alarm (minor operational friction) and a missed injection (potential data breach) justifies the recall bias.
F1 Score Limitations You Should Know
F1 is useful because it makes an important trade-off visible, but it is still only one lens on model behavior. If you rely on it alone, you can miss important differences in cost, confidence, and threshold behavior.
These limitations do not make F1 less valuable. They define where it works well and where you need additional metrics to make sound deployment decisions.
Equal Weighting May Not Match Your Risk Profile
The standard F1 Score assumes precision and recall are equally important. In practice, they rarely are, and defaulting to equal weighting can lead you to optimize for the wrong outcome.
In fraud detection, missing a fraudulent transaction (false negative) is typically more costly than flagging a legitimate one (false positive). Optimizing for equal weighting may leave you under-indexed on recall.
Conversely, in medical diagnosis, an incorrect positive diagnosis can trigger unnecessary treatments and patient anxiety, making precision the priority. In autonomous agent workflows, a missed safety violation could cascade into real-world harm, while a false positive only triggers an unnecessary human review.
The Fβ Score addresses this partially, but you still need to explicitly define your cost structure rather than defaulting to equal weighting. Before selecting any F1 variant, map out the business cost of each error type. That cost analysis, not the metric default, should drive your threshold decisions.
Single Threshold Blindness And Confidence Gaps
F1 is calculated at a single decision threshold. It tells you nothing about how confident your model is in its predictions. Two models can produce identical F1 scores while having very different confidence distributions: one might be highly certain about its predictions while the other hovers near the decision boundary. In production, that difference matters significantly because even borderline predictions are more likely to flip under distribution shift.
This limitation is especially problematic for LLM evals, where uncertainty matters. Recent perplexity research argues that metrics focused mainly on the chosen token's probability can miss important information about the broader output distribution, making perplexity an unreliable indicator of true model confidence.
For risk-based AI systems, such as financial risk assessment or autonomous agent decision-making, understanding model confidence is essential for responsible deployment. Pairing F1 with confidence calibration tools gives you a more complete picture of model reliability than either metric alone.
F1 Score Vs. Other Evaluation Metrics
You do not choose evaluation metrics in isolation. F1, accuracy, and AUC-ROC answer different questions, and looking at them side by side makes it easier to see where each one helps and where it can mislead.
F1 Score Vs. Accuracy
Accuracy measures the percentage of all predictions that are correct: (TP + TN) / (TP + TN + FP + FN). The inclusion of true negatives in the numerator is precisely what makes it misleading on imbalanced data. As this analysis explains, a model predicting the majority class for all examples on a dataset with 1:100 class imbalance achieves 99% accuracy while learning nothing useful about the minority class.
F1 excludes true negatives entirely. Per scikit-learn's documentation, F1 = 2×TP / (2×TP + FP + FN), meaning a model cannot inflate its score by correctly classifying the majority class. Use accuracy only when your classes are reasonably balanced. For imbalanced datasets, F1 is often a more informative choice because it forces you to confront how well the model handles the cases that actually matter.
When you are evaluating AI model accuracy across your pipeline, treating F1 and accuracy as complementary rather than interchangeable gives you a more complete picture.
F1 Score Vs. AUC ROC
AUC-ROC evaluates model performance across all possible decision thresholds, summarizing ranking quality into a single number. F1 evaluates at a specific operating threshold, making it a deployment-ready metric. Each answers a different question: AUC-ROC tells you how well the model separates classes overall, while F1 tells you how well it performs at the threshold you actually ship to production.
The critical distinction is that on severely imbalanced datasets, AUC-ROC can be misleading because the false positive rate denominator includes the large number of true negatives, compressing FPR even when false positives are high in absolute terms. As precision-recall analysis demonstrates, Precision-Recall curves and PR-AUC are more appropriate than ROC curves for highly imbalanced datasets. This is especially relevant for safety classifiers where the harmful class is rare.
Use AUC-ROC evaluation during model development for threshold-agnostic comparison. Switch to F1 (or PR-AUC) when selecting your production threshold and evaluating deployment-ready performance.
How F1 Score Applies to LLM And Agent Evaluation
Most explanations of F1 stay in classical ML, but production LLM systems create a more specific set of needs. You can use F1 confidently for safety and compliance classifiers, but you should not expect it to tell you whether generated answers are factual, well-reasoned, or useful.
Classification-Based Safety And Compliance Metrics
Many LLM safety evals are fundamentally classification tasks where F1 is the standard performance measure. Toxicity detection, PII detection, prompt injection identification, and bias or sexism classification all produce binary or multi-class outputs where precision and recall directly map to operational risk. Your safety pipeline needs to balance catching harmful content against over-blocking legitimate user interactions.
Published benchmarks show wide F1 variation across safety categories. Prompt injection detectors range from 0.86 to 0.97 F1 depending on architecture and dataset, as recent benchmarking research demonstrates. PII classifiers show similar variance; Roblox's production model achieves 94% F1 on real chat data while simpler NER approaches score much lower. Purpose-built eval models such as Luna-2 achieve a 0.95 F1 score with average latency around 152ms at a fraction of the cost of general-purpose LLM judges.
For these classification-based safety checks, F1 remains the right metric. It directly captures the trade-off between over-blocking legitimate content (precision) and missing harmful outputs (recall), which is exactly the decision your safety and compliance pipeline needs to optimize.
Moving Beyond F1 for Generative AI Quality
F1 works for classification subtasks within LLM pipelines, but it cannot evaluate the core generative capabilities you care about most. Automatically assessing the quality of generated text remains one of the harder open problems in NLP, because the criteria are subjective, context-dependent, and difficult to reduce to binary labels. You cannot score a hallucination the same way you score a toxicity flag.
Hallucination detection requires checking factual accuracy against external knowledge, not token overlap. Instruction adherence demands structural and semantic evaluation of complex multi-step requirements. Reasoning coherence, as recent research demonstrates, requires assessing logical consistency across reasoning steps rather than surface-level fluency.
In practice, you will combine F1-based safety metrics with LLM judge metrics for subjective quality dimensions and custom domain-specific metrics for business requirements. Production AI evals work best as layered systems where each metric type covers a different failure mode, and no single score carries the full picture.
Choosing F1 as Part of a Stronger Eval Stack
F1 is one of the clearest ways to measure whether your classifier balances precision and recall, especially when class imbalance makes accuracy misleading. It works well for safety classifiers, threshold selection, and model comparison when both error types matter. At the same time, it does not capture confidence, generative quality, or the broader behavior of autonomous agents in production. That is why leading AI teams pair F1 with complementary metrics, broader observability, and guardrails that can act on failures in real time.
Galileo is the agent observability and guardrails platform that helps engineers ship reliable AI agents with visibility, evaluation, and control:
Metrics Engine: Run safety, quality, and agentic metrics alongside custom LLM-as-judge and code-based evaluators in a single workflow.
Luna-2 models: Purpose-built evaluation SLMs with strong F1 performance, 98% lower cost than LLM-based evaluation, and sub-200ms latency.
Runtime Protection: Turn safety checks into real-time guardrails that intercept unsafe outputs, PII leakage, and prompt injections before they reach your users.
CLHF: Improve metric accuracy by up to 30% with as few as 2-5 annotated examples through Continuous Learning via Human Feedback.
Signals: Surface failure patterns automatically by analyzing production traces without requiring manual search.
Book a demo to see how Galileo can help you turn F1-based evals into a production-scale reliability workflow.
FAQ
What is a good F1 score for AI models? There is no universal threshold. Prompt injection detectors range from 0.86 to 0.97 F1 depending on architecture, while production PII classifiers report 0.81 to 0.94 F1 depending on model complexity. Define "good" based on published benchmarks for your specific task and the operational cost of false positives versus false negatives.
How do I choose between Macro F1 and Weighted F1? Use Macro F1 when every class matters equally to your eval, regardless of how frequently each class appears. Use Weighted F1 when class frequency should influence the score, such as in imbalanced datasets where rare classes need representation but common classes still drive most of your production traffic.
When should I use F1 score vs. AUC-ROC? Use AUC-ROC during model development for threshold-agnostic comparison across model architectures. Switch to F1 when you need a deployment-ready metric at a specific decision threshold. For severely imbalanced datasets, consider F1 or PR-AUC, as AUC-ROC can make minority-class performance look better than it is.
How does the F1 score apply to LLM evaluation? F1 applies directly to classification-based LLM eval tasks: toxicity detection, PII identification, prompt injection detection, and bias classification. It does not apply to generative quality dimensions like hallucination, instruction adherence, or reasoning coherence, which require LLM-as-judge evaluators or embedding-based metrics.
How does Galileo use F1 in its evaluation platform? Galileo's Luna-2 small language models deliver strong F1 performance for AI eval and guardrailing tasks at 98% lower cost than LLM-based evaluation. The platform's Metrics Engine supports safety and compliance evals including prompt injection, PII, toxicity, and sexism detection, alongside response quality and agentic performance metrics.

Pratik Bhavsar