Feb 2, 2026

What Is BERTScore and How Does It Work for NLP Evaluation?

Jackson Wells

Integrated Marketing

Jackson Wells

Integrated Marketing

A dark navy-blue webinar promotional banner from Galileo. The top left shows the Galileo logo with a red starburst icon. Below it is a light blue pill-shaped label that reads “→ Webinar.” The main title in white and light blue text says, “How to Productionize Agentic Applications.” On the right side is a large, abstract red arrow-like graphic pointing left.
A dark navy-blue webinar promotional banner from Galileo. The top left shows the Galileo logo with a red starburst icon. Below it is a light blue pill-shaped label that reads “→ Webinar.” The main title in white and light blue text says, “How to Productionize Agentic Applications.” On the right side is a large, abstract red arrow-like graphic pointing left.

Evaluating the nuanced outputs delivered by large language models (LLMs) has long been a significant challenge in artificial intelligence. Traditional n-gram-based metrics such as BLEU and ROUGE often struggle to capture the semantic depth and context of human language, which is essential for tasks like machine translation and text summarization.

Models like BLEU and ROUGE rely on n-gram matching, which frequently fails to align with human judgments. BERTScore addresses this limitation by leveraging pre-trained contextual embeddings from transformer models (such as BERT, RoBERTa, and DeBERTa) to enable context-aware semantic similarity evaluation. 

This article explores BERTScore's methodology, its advantages over conventional metrics, and how to implement it effectively in production AI systems.

TLDR:

  • BERTScore evaluates text similarity using contextual embeddings from transformer models like BERT, RoBERTa, and DeBERTa, not word matching.

  • It calculates precision, recall, and F1 using cosine similarity between contextual embeddings, with optional IDF weighting and baseline rescaling for improved interpretability.

  • BERTScore demonstrates superior correlation with human judgment in semantic tasks like personalized text generation (59% vs 47-50% for BLEU/ROUGE), but performance is highly domain-dependent—BLEU remains competitive in machine translation (0.78-0.91 correlation), and correlations are weak in medical NLP contexts.

  • Use it for open-ended generation, summarization, and text generation evaluation, but always combine with traditional metrics and human evaluation in production systems.

What is BERTScore?

BERTScore is a semantic evaluation metric for natural language processing (NLP) that transcends surface-level word overlap by leveraging contextual embeddings generated by transformer models. This metric computes token-level cosine similarities between contextual embeddings and aggregates them into precision, recall, and F1 scores.

Unlike traditional n-gram-based metrics such as BLEU or ROUGE, BERTScore leverages pre-trained transformer models (BERT, RoBERTa, XLNet) to generate context-dependent token representations. 

This means it can recognize that "The cat sits on the mat" and "A feline rests upon a rug" convey the same meaning—even though they share no common words—because it understands semantic similarity rather than relying on exact word matches. This shift from quantity-based matching to quality-based understanding represents a fundamental advancement in how we evaluate AI-generated text.

The metric supports multiple pre-trained models including BERT, RoBERTa, and DeBERTa, with the official repository now recommending microsoft/deberta-xlarge-mnli for highest accuracy or microsoft/deberta-large-mnli for speed-optimized deployments, replacing the older roberta-large default.

Master LLM-as-a-Judge evaluation to ensure quality, catch failures, and build reliable AI apps

How BERTScore Works: Step-by-Step

Understanding BERTScore's calculation process helps practitioners configure it effectively and interpret results accurately.

1. Tokenization and Embedding Generation

Both candidate and reference texts are tokenized and passed through a pre-trained transformer model (such as BERT, RoBERTa, or XLNet). Each token receives a contextual embedding based on its surrounding context rather than a static word embedding—meaning the word "bank" gets different representations in "river bank" versus "bank account." 

This context-dependent approach represents the fundamental innovation that distinguishes BERTScore from traditional n-gram-based metrics like BLEU or ROUGE, enabling the metric to recognize semantic equivalence between paraphrases and synonyms that would be missed by exact string matching.

According to the official implementation, practitioners can specify which model to use via the model_type parameter and select specific transformer layers using num_layers.

2. Cosine Similarity Calculation

BERTScore computes pairwise cosine similarities between all tokens in the candidate sentence and all tokens in the reference sentence. This produces a similarity matrix where each element represents the contextual similarity between a token pair.

3. Precision, Recall, and F1 Scores

The metric calculates three values through greedy matching: precision (the average of maximum cosine similarities between candidate tokens and reference tokens), recall (the average of maximum similarities between reference tokens and candidate tokens), and F1 score (the harmonic mean of precision and recall).

  1. Precision: For each token in the candidate sentence, identify the token in the reference sentence with the maximum cosine similarity. Precision is the average of these maximum similarity scores across all candidate tokens.

  2. Recall: For each token in the reference sentence, find the most similar token in the candidate sentence. Recall is the average of these maximum similarities across all reference tokens.

  3. F1 Score: Computed as the harmonic mean of precision and recall, providing a balanced measure of overall semantic similarity.

4. IDF Weighting and Baseline Rescaling

Two optional enhancements improve BERTScore's effectiveness: IDF weighting, which emphasizes rare, informative tokens by weighting each token's cosine similarity contribution by its inverse document frequency value, and baseline rescaling for interpretability, which normalizes scores relative to baseline performance to make them more comparable across different evaluation scenarios.

IDF Weighting: Inverse document frequency weighting emphasizes rare, informative tokens over common words. Enable this by setting idf=True in the Python API.

Baseline Rescaling: Raw BERTScore values can be difficult to interpret. Rescaling normalizes scores relative to baseline performance, making them more comparable and "human-readable." The rescaling feature does not affect BERTScore's correlation with human judgment, as measured by Pearson's r and Kendall's τ coefficients. For production deployments, set rescale_with_baseline=True by default.

BERTScore vs BLEU vs ROUGE: When to Use Each

Choosing the right evaluation metric depends on your specific task and requirements. Here's how these metrics compare based on recent research:

Metric

Measures

Best For

Understands Meaning?

Speed

BLEU

N-gram precision

Machine translation

❌ No

✅ Fast

ROUGE

N-gram recall/overlap

Text summarization

❌ No

✅ Fast

BERTScore

Contextual semantic similarity

Text generation, paraphrase

✅ Yes

⚠️ Slower

Human Judgment Correlation

BERTScore demonstrates significantly stronger alignment with human evaluation compared to traditional metrics in specific task domains, though correlation varies substantially based on the application context. Research published in the ACL 2025 Findings documents quantitative human judgment correlation data across multiple NLP tasks through large-scale human evaluation with rigorous inter-annotator agreement protocols (0.823 Fleiss' kappa).

  • Personalized Long-Form Text Generation: BERTScore achieved 59% alignment with human majority voting in abstract generation, review writing, and topic writing tasks within the LongLaMP benchmark, outperforming BLEU (47%), ROUGE-L (50%), and METEOR (47%) by 12 percentage points.

  • Medical Text Summarization: According to the NPJ Digital Medicine systematic review, BERTScore demonstrated mixed correlation results across clinical tasks—Completeness metrics ranged from Pearson r = 0.28-0.44 and Spearman ρ = 0.15-0.645, while Correctness metrics showed Pearson r = 0.23-0.52 and Spearman ρ = 0.15-0.530. Notably, ROUGE exhibited catastrophic failure modes in certain medical contexts with negative Spearman correlations (ρ = -0.66 to -0.77), demonstrating the metric measured inverse quality.

  • Machine Translation Quality Estimation: The WMT 2025 Shared Task evaluated metrics across 8 language pairs, reporting system-level correlations of 0.759-0.904 for BERTScore and 0.781-0.908 for BLEU, with segment-level correlations of 0.579-0.728 for BERTScore and 0.575-0.720 for BLEU, demonstrating that BLEU remains competitive in translation evaluation despite BERTScore's theoretical semantic advantages.

The research consensus emphasizes that automated metrics should serve as preliminary screening tools paired with human evaluation, not standalone validity measures, with mandatory human evaluation required for medical and clinical applications where automated metric correlations prove insufficient.

According to the ExPerT study published in ACL 2025 Findings, researchers conducted large-scale human evaluation (100 examples, 3 annotators per example, inter-annotator agreement of 0.823 Fleiss' kappa) measuring alignment between automated metrics and human majority voting:

  • BERTScore: 59% alignment with human judgments

  • ROUGE-L: 50% alignment

  • BLEU: 47% alignment

This 12-percentage-point improvement demonstrates BERTScore's advantage when paraphrasing and semantic equivalence matter more than exact wording, as shown in personalized long-form text generation tasks like abstract and review writing.

When to Use Each Metric

Use BERTScore when:

  • Semantic accuracy matters more than exact wording, as demonstrated by 59% human alignment compared to 47-50% for traditional metrics in personalized text generation tasks

  • Evaluating paraphrases or creative text where multiple valid phrasings exist for semantically equivalent content

  • BLEU/ROUGE scores seem unfairly low for good outputs, particularly in open-ended generation scenarios

  • You have computational resources available, acknowledging the 5-10x performance overhead compared to traditional metrics

Use BLEU/ROUGE instead when:

  • Speed is critical (BERTScore requires transformer inference, with 5-10x computational overhead compared to n-gram metrics)

  • Exact terminology matters (legal, medical contexts where traditional metrics show stronger correlation in specific domains)

  • You need quick regression testing in CI/CD pipelines (BLEU/ROUGE enable millisecond-level latency vs transformer inference requirements)

  • Translation benchmarking in established workflows (BLEU achieves system-level correlations of 0.78-0.91, comparable to BERTScore)

The WMT 2025 benchmarks showed that BLEU achieved system-level correlations of 0.781-0.908 in machine translation—comparable to BERTScore (0.759-0.904) in this established domain.

Best Practices for Implementing BERTScore in AI Evals

To maximize BERTScore in your evaluations, follow specific best practices that are well-supported by research and successful implementations. From setting up evaluation pipelines to baseline calibration, let's explore some research-backed best practices.

Setting Up Evaluation Pipelines

Setting up a robust evals pipeline is always recommended when implementing BERTScore. Begin by installing necessary packages such as bert-score, transformers, and torch. These should be run on a CUDA-enabled GPU, as BERT models are computationally intensive. Here is a Python implementation to start with:

from bert_score import score
candidate = ["The quick brown fox jumps over the lazy dog"]
reference = ["A brown fox quickly jumps over a lazy dog"]
P, R, F1 = score(candidate, reference, lang="en")

Configure pipelines to support batch processing and custom model configurations for enhanced scalability and efficiency. Tools like Dask or Spark can be integrated for parallel processing of large datasets, ensuring that the evaluation remains accurate without compromising computation speed.

Framework Integration and Baseline Establishment

Seamlessly integrating BERTScore into existing frameworks requires careful attention to both technical dependencies and evaluation standards:

  • Technical Integration: Cache model weights to avoid repeated downloads, and align batch sizes with your system's memory capabilities.

  • Data Preprocessing: Consistent preprocessing of input texts is vital to maintaining data integrity across evaluations, while avoiding mixed language models in the same evaluation process ensures coherent results.

  • Baseline Configuration: As highlighted in Lukas Heller's article on BERTScore, setting the rescale_with_baseline parameter to True normalizes scores and provides consistency with traditional metrics like BLEU and ROUGE, producing more truthful precision scores across different evaluation scenarios.

  • IDF Weighting Implementation: Incorporating IDF weighting enhances BERTScore's evaluation precision by appropriately weighting rare words that often carry significant semantic value.

This systematic approach to framework integration and baseline establishment ensures that your BERTScore implementations function efficiently and provide evaluations that closely align with human interpretations of text quality. Teams can confidently deploy BERTScore across their evaluation pipelines while maintaining consistent and reliable assessment standards.

Monitoring and Maintenance

To maintain the efficacy of BERTScore, ongoing monitoring is essential. Identifying and addressing potential data errors, as measured by the Data Error Potential metric, helps ensure consistent and accurate evaluations. Establish logging mechanisms to track evaluation progress and manage errors effectively:

import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
try:
    scores = scorer.score(candidates, references)
except Exception as e:
    logger.error(f"Scoring failed: {e}")

Keep your libraries and model versions updated regularly to mitigate inconsistencies in scoring.

Common BERTScore Use Cases

BERTScore's semantic evaluation capabilities make it valuable across a wide range of NLP applications. Here are the most common production use cases where teams leverage BERTScore for quality assessment.

Machine Translation Evaluation

BERTScore has been adopted as an official evaluation metric in SemEval-2025 Task 2: Entity-Aware Machine Translation, representing current academic best practices for translation quality assessment. Research published in Nature Scientific Reports demonstrates hybrid approaches combining BERTScore with SVM classifiers for domain-specific translation assessment.

Text Summarization Assessment

BERTScore excels in evaluating text summarization tasks where capturing semantic fidelity matters more than exact word overlap. Traditional metrics like ROUGE often penalize summaries that use different phrasing to convey the same meaning, while BERTScore recognizes that "The company reported strong quarterly earnings" and "The firm announced robust financial results for the quarter" express equivalent information. 

This semantic understanding makes BERTScore particularly valuable for abstractive summarization, where models generate novel phrasing rather than extracting verbatim text. However, practitioners should note that domain-specific applications require careful validation—medical summarization contexts show weaker correlations with human judgment, emphasizing the need to combine BERTScore with domain-expert evaluation in specialized fields.

Read our case study with Magid to learn more about how Galileo empowers newsrooms and other content-forward organizations worldwide. 

LLM Content Quality Measurement

Research published in Nature Scientific Reports evaluated five state-of-the-art LLMs (LLaMA-3 70B, Gemini 2.5 Pro, DeepSeek R1, GPT Omni 3.0, Grok 3) for generating software task flows, using BERTScore alongside SBERT, USE, and hybrid metrics for measuring semantic similarity between human-annotated and AI-generated content. 

This implementation demonstrates how production systems combine BERTScore with complementary semantic and domain-specific metrics within comprehensive evaluation frameworks rather than deploying it in isolation.

Implementing BERTScore for Production AI Evaluation

BERTScore represents a significant advancement over traditional metrics by capturing semantic similarity rather than surface-level word matching. For teams evaluating text generation, translation, or summarization systems, it provides scores that align more closely with human judgment, particularly in personalized content generation and creative writing where multiple valid phrasings exist. 

Galileo is a cutting-edge evaluation and observability platform designed to empower developers to improve their AI apps that works with all major LLM providers. While BERTScore isn't a built-in metric, Galileo provides several offerings to integrate it into your evaluation workflows:

  • Custom code-based metrics: Galileo supports custom code-based metrics that let you implement BERTScore as a scorer function. You can define specific evaluation criteria for your LLM applications using either registered custom metrics (shared organization-wide) or local metrics (running in your notebook environment). 

  • Flexible execution environment: For registered custom metrics, Galileo runs them in a sandbox Python 3.10 environment, where you can install PyPI packages like bert-score using the uv script dependency format, giving you access to the BERTScore library directly in Galileo's platform. 

  • Local metrics for full library access: Local metrics let you use any library or custom Python code, including calling out to LLMs or other APIs, making it easy to run computationally intensive BERTScore evaluations using your local GPU resources. 

  • Integration with experiments and log streams: Custom BERTScore metrics can be used with experiments and Log streams, allowing you to track semantic similarity scores alongside Galileo's out-of-the-box metrics like hallucination detection, context adherence, and response quality. 

  • Aggregation and reporting: Galileo's custom metrics framework includes an aggregator function that aggregates individual scores into summary metrics across experiments, so you can compute average BERTScore, precision, recall, and F1 across your entire evaluation dataset. 

Book a demo to see how Galileo's custom metrics framework can help you implement BERTScore and other semantic evaluation metrics at scale.

FAQs

What is BERTScore and how does it work?

BERTScore is an NLP evaluation metric that uses BERT's contextual embeddings to measure semantic similarity between generated and reference texts. Unlike BLEU or ROUGE, it captures meaning rather than exact word matches, calculating precision, recall, and F1 scores based on cosine similarity between token embeddings. According to the ExPerT study published in ACL 2025 Findings, BERTScore achieves 59% alignment with human judgments compared to 47% for BLEU.

How do I calculate BERTScore in Python?

Install the bert-score library with pip install bert-score, then use from bert_score import score followed by P, R, F1 = score(candidates, references, lang="en", rescale_with_baseline=True). For production systems processing multiple evaluations, use the BERTScorer object to cache the model and improve efficiency. According to the official documentation, this caching is critical for computational efficiency when performing multiple evaluations in production environments. Run on a CUDA-enabled GPU for faster computation, though be aware of the computational overhead—BERTScore requires 5-10x more processing time compared to traditional metrics like BLEU and ROUGE.

What is a good BERTScore F1 value?

According to authoritative documentation, specific F1 score ranges for different NLP tasks are not published in official BERTScore resources or peer-reviewed research. The official baseline rescaling documentation focuses on improving score "readability" through rescaling but does not establish absolute performance thresholds. Instead, AI/ML engineers should establish task-specific baselines empirically through validation sets and human evaluation correlation studies rather than relying on universal score benchmarks that do not exist in authoritative sources.

Authoritative documentation does not provide specific F1 score ranges or universal thresholds for different NLP tasks because performance varies significantly by domain and task. Always use rescale_with_baseline=True for more interpretable scores. Rather than targeting absolute values, establish empirical baselines through validation sets with human evaluation correlation, then track relative improvements across model iterations.

How do IDF weighting and baseline rescaling enhance BERTScore results?

IDF weighting prioritizes rare and informative tokens in the text, reducing the influence of common words that provide minimal semantic value. By emphasizing these rare words, the evaluation captures more meaningful context, improving overall precision and recall scores. Enable IDF weighting by setting idf=True in the Python API. Baseline rescaling normalizes raw BERTScore values, making them more comparable across different evaluation scenarios and adjusting scores to a more interpretable scale. This is particularly useful in production environments where clear, comparable metrics are essential for decision-making—use rescale_with_baseline=True by default.

Evaluating the nuanced outputs delivered by large language models (LLMs) has long been a significant challenge in artificial intelligence. Traditional n-gram-based metrics such as BLEU and ROUGE often struggle to capture the semantic depth and context of human language, which is essential for tasks like machine translation and text summarization.

Models like BLEU and ROUGE rely on n-gram matching, which frequently fails to align with human judgments. BERTScore addresses this limitation by leveraging pre-trained contextual embeddings from transformer models (such as BERT, RoBERTa, and DeBERTa) to enable context-aware semantic similarity evaluation. 

This article explores BERTScore's methodology, its advantages over conventional metrics, and how to implement it effectively in production AI systems.

TLDR:

  • BERTScore evaluates text similarity using contextual embeddings from transformer models like BERT, RoBERTa, and DeBERTa, not word matching.

  • It calculates precision, recall, and F1 using cosine similarity between contextual embeddings, with optional IDF weighting and baseline rescaling for improved interpretability.

  • BERTScore demonstrates superior correlation with human judgment in semantic tasks like personalized text generation (59% vs 47-50% for BLEU/ROUGE), but performance is highly domain-dependent—BLEU remains competitive in machine translation (0.78-0.91 correlation), and correlations are weak in medical NLP contexts.

  • Use it for open-ended generation, summarization, and text generation evaluation, but always combine with traditional metrics and human evaluation in production systems.

What is BERTScore?

BERTScore is a semantic evaluation metric for natural language processing (NLP) that transcends surface-level word overlap by leveraging contextual embeddings generated by transformer models. This metric computes token-level cosine similarities between contextual embeddings and aggregates them into precision, recall, and F1 scores.

Unlike traditional n-gram-based metrics such as BLEU or ROUGE, BERTScore leverages pre-trained transformer models (BERT, RoBERTa, XLNet) to generate context-dependent token representations. 

This means it can recognize that "The cat sits on the mat" and "A feline rests upon a rug" convey the same meaning—even though they share no common words—because it understands semantic similarity rather than relying on exact word matches. This shift from quantity-based matching to quality-based understanding represents a fundamental advancement in how we evaluate AI-generated text.

The metric supports multiple pre-trained models including BERT, RoBERTa, and DeBERTa, with the official repository now recommending microsoft/deberta-xlarge-mnli for highest accuracy or microsoft/deberta-large-mnli for speed-optimized deployments, replacing the older roberta-large default.

Master LLM-as-a-Judge evaluation to ensure quality, catch failures, and build reliable AI apps

How BERTScore Works: Step-by-Step

Understanding BERTScore's calculation process helps practitioners configure it effectively and interpret results accurately.

1. Tokenization and Embedding Generation

Both candidate and reference texts are tokenized and passed through a pre-trained transformer model (such as BERT, RoBERTa, or XLNet). Each token receives a contextual embedding based on its surrounding context rather than a static word embedding—meaning the word "bank" gets different representations in "river bank" versus "bank account." 

This context-dependent approach represents the fundamental innovation that distinguishes BERTScore from traditional n-gram-based metrics like BLEU or ROUGE, enabling the metric to recognize semantic equivalence between paraphrases and synonyms that would be missed by exact string matching.

According to the official implementation, practitioners can specify which model to use via the model_type parameter and select specific transformer layers using num_layers.

2. Cosine Similarity Calculation

BERTScore computes pairwise cosine similarities between all tokens in the candidate sentence and all tokens in the reference sentence. This produces a similarity matrix where each element represents the contextual similarity between a token pair.

3. Precision, Recall, and F1 Scores

The metric calculates three values through greedy matching: precision (the average of maximum cosine similarities between candidate tokens and reference tokens), recall (the average of maximum similarities between reference tokens and candidate tokens), and F1 score (the harmonic mean of precision and recall).

  1. Precision: For each token in the candidate sentence, identify the token in the reference sentence with the maximum cosine similarity. Precision is the average of these maximum similarity scores across all candidate tokens.

  2. Recall: For each token in the reference sentence, find the most similar token in the candidate sentence. Recall is the average of these maximum similarities across all reference tokens.

  3. F1 Score: Computed as the harmonic mean of precision and recall, providing a balanced measure of overall semantic similarity.

4. IDF Weighting and Baseline Rescaling

Two optional enhancements improve BERTScore's effectiveness: IDF weighting, which emphasizes rare, informative tokens by weighting each token's cosine similarity contribution by its inverse document frequency value, and baseline rescaling for interpretability, which normalizes scores relative to baseline performance to make them more comparable across different evaluation scenarios.

IDF Weighting: Inverse document frequency weighting emphasizes rare, informative tokens over common words. Enable this by setting idf=True in the Python API.

Baseline Rescaling: Raw BERTScore values can be difficult to interpret. Rescaling normalizes scores relative to baseline performance, making them more comparable and "human-readable." The rescaling feature does not affect BERTScore's correlation with human judgment, as measured by Pearson's r and Kendall's τ coefficients. For production deployments, set rescale_with_baseline=True by default.

BERTScore vs BLEU vs ROUGE: When to Use Each

Choosing the right evaluation metric depends on your specific task and requirements. Here's how these metrics compare based on recent research:

Metric

Measures

Best For

Understands Meaning?

Speed

BLEU

N-gram precision

Machine translation

❌ No

✅ Fast

ROUGE

N-gram recall/overlap

Text summarization

❌ No

✅ Fast

BERTScore

Contextual semantic similarity

Text generation, paraphrase

✅ Yes

⚠️ Slower

Human Judgment Correlation

BERTScore demonstrates significantly stronger alignment with human evaluation compared to traditional metrics in specific task domains, though correlation varies substantially based on the application context. Research published in the ACL 2025 Findings documents quantitative human judgment correlation data across multiple NLP tasks through large-scale human evaluation with rigorous inter-annotator agreement protocols (0.823 Fleiss' kappa).

  • Personalized Long-Form Text Generation: BERTScore achieved 59% alignment with human majority voting in abstract generation, review writing, and topic writing tasks within the LongLaMP benchmark, outperforming BLEU (47%), ROUGE-L (50%), and METEOR (47%) by 12 percentage points.

  • Medical Text Summarization: According to the NPJ Digital Medicine systematic review, BERTScore demonstrated mixed correlation results across clinical tasks—Completeness metrics ranged from Pearson r = 0.28-0.44 and Spearman ρ = 0.15-0.645, while Correctness metrics showed Pearson r = 0.23-0.52 and Spearman ρ = 0.15-0.530. Notably, ROUGE exhibited catastrophic failure modes in certain medical contexts with negative Spearman correlations (ρ = -0.66 to -0.77), demonstrating the metric measured inverse quality.

  • Machine Translation Quality Estimation: The WMT 2025 Shared Task evaluated metrics across 8 language pairs, reporting system-level correlations of 0.759-0.904 for BERTScore and 0.781-0.908 for BLEU, with segment-level correlations of 0.579-0.728 for BERTScore and 0.575-0.720 for BLEU, demonstrating that BLEU remains competitive in translation evaluation despite BERTScore's theoretical semantic advantages.

The research consensus emphasizes that automated metrics should serve as preliminary screening tools paired with human evaluation, not standalone validity measures, with mandatory human evaluation required for medical and clinical applications where automated metric correlations prove insufficient.

According to the ExPerT study published in ACL 2025 Findings, researchers conducted large-scale human evaluation (100 examples, 3 annotators per example, inter-annotator agreement of 0.823 Fleiss' kappa) measuring alignment between automated metrics and human majority voting:

  • BERTScore: 59% alignment with human judgments

  • ROUGE-L: 50% alignment

  • BLEU: 47% alignment

This 12-percentage-point improvement demonstrates BERTScore's advantage when paraphrasing and semantic equivalence matter more than exact wording, as shown in personalized long-form text generation tasks like abstract and review writing.

When to Use Each Metric

Use BERTScore when:

  • Semantic accuracy matters more than exact wording, as demonstrated by 59% human alignment compared to 47-50% for traditional metrics in personalized text generation tasks

  • Evaluating paraphrases or creative text where multiple valid phrasings exist for semantically equivalent content

  • BLEU/ROUGE scores seem unfairly low for good outputs, particularly in open-ended generation scenarios

  • You have computational resources available, acknowledging the 5-10x performance overhead compared to traditional metrics

Use BLEU/ROUGE instead when:

  • Speed is critical (BERTScore requires transformer inference, with 5-10x computational overhead compared to n-gram metrics)

  • Exact terminology matters (legal, medical contexts where traditional metrics show stronger correlation in specific domains)

  • You need quick regression testing in CI/CD pipelines (BLEU/ROUGE enable millisecond-level latency vs transformer inference requirements)

  • Translation benchmarking in established workflows (BLEU achieves system-level correlations of 0.78-0.91, comparable to BERTScore)

The WMT 2025 benchmarks showed that BLEU achieved system-level correlations of 0.781-0.908 in machine translation—comparable to BERTScore (0.759-0.904) in this established domain.

Best Practices for Implementing BERTScore in AI Evals

To maximize BERTScore in your evaluations, follow specific best practices that are well-supported by research and successful implementations. From setting up evaluation pipelines to baseline calibration, let's explore some research-backed best practices.

Setting Up Evaluation Pipelines

Setting up a robust evals pipeline is always recommended when implementing BERTScore. Begin by installing necessary packages such as bert-score, transformers, and torch. These should be run on a CUDA-enabled GPU, as BERT models are computationally intensive. Here is a Python implementation to start with:

from bert_score import score
candidate = ["The quick brown fox jumps over the lazy dog"]
reference = ["A brown fox quickly jumps over a lazy dog"]
P, R, F1 = score(candidate, reference, lang="en")

Configure pipelines to support batch processing and custom model configurations for enhanced scalability and efficiency. Tools like Dask or Spark can be integrated for parallel processing of large datasets, ensuring that the evaluation remains accurate without compromising computation speed.

Framework Integration and Baseline Establishment

Seamlessly integrating BERTScore into existing frameworks requires careful attention to both technical dependencies and evaluation standards:

  • Technical Integration: Cache model weights to avoid repeated downloads, and align batch sizes with your system's memory capabilities.

  • Data Preprocessing: Consistent preprocessing of input texts is vital to maintaining data integrity across evaluations, while avoiding mixed language models in the same evaluation process ensures coherent results.

  • Baseline Configuration: As highlighted in Lukas Heller's article on BERTScore, setting the rescale_with_baseline parameter to True normalizes scores and provides consistency with traditional metrics like BLEU and ROUGE, producing more truthful precision scores across different evaluation scenarios.

  • IDF Weighting Implementation: Incorporating IDF weighting enhances BERTScore's evaluation precision by appropriately weighting rare words that often carry significant semantic value.

This systematic approach to framework integration and baseline establishment ensures that your BERTScore implementations function efficiently and provide evaluations that closely align with human interpretations of text quality. Teams can confidently deploy BERTScore across their evaluation pipelines while maintaining consistent and reliable assessment standards.

Monitoring and Maintenance

To maintain the efficacy of BERTScore, ongoing monitoring is essential. Identifying and addressing potential data errors, as measured by the Data Error Potential metric, helps ensure consistent and accurate evaluations. Establish logging mechanisms to track evaluation progress and manage errors effectively:

import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
try:
    scores = scorer.score(candidates, references)
except Exception as e:
    logger.error(f"Scoring failed: {e}")

Keep your libraries and model versions updated regularly to mitigate inconsistencies in scoring.

Common BERTScore Use Cases

BERTScore's semantic evaluation capabilities make it valuable across a wide range of NLP applications. Here are the most common production use cases where teams leverage BERTScore for quality assessment.

Machine Translation Evaluation

BERTScore has been adopted as an official evaluation metric in SemEval-2025 Task 2: Entity-Aware Machine Translation, representing current academic best practices for translation quality assessment. Research published in Nature Scientific Reports demonstrates hybrid approaches combining BERTScore with SVM classifiers for domain-specific translation assessment.

Text Summarization Assessment

BERTScore excels in evaluating text summarization tasks where capturing semantic fidelity matters more than exact word overlap. Traditional metrics like ROUGE often penalize summaries that use different phrasing to convey the same meaning, while BERTScore recognizes that "The company reported strong quarterly earnings" and "The firm announced robust financial results for the quarter" express equivalent information. 

This semantic understanding makes BERTScore particularly valuable for abstractive summarization, where models generate novel phrasing rather than extracting verbatim text. However, practitioners should note that domain-specific applications require careful validation—medical summarization contexts show weaker correlations with human judgment, emphasizing the need to combine BERTScore with domain-expert evaluation in specialized fields.

Read our case study with Magid to learn more about how Galileo empowers newsrooms and other content-forward organizations worldwide. 

LLM Content Quality Measurement

Research published in Nature Scientific Reports evaluated five state-of-the-art LLMs (LLaMA-3 70B, Gemini 2.5 Pro, DeepSeek R1, GPT Omni 3.0, Grok 3) for generating software task flows, using BERTScore alongside SBERT, USE, and hybrid metrics for measuring semantic similarity between human-annotated and AI-generated content. 

This implementation demonstrates how production systems combine BERTScore with complementary semantic and domain-specific metrics within comprehensive evaluation frameworks rather than deploying it in isolation.

Implementing BERTScore for Production AI Evaluation

BERTScore represents a significant advancement over traditional metrics by capturing semantic similarity rather than surface-level word matching. For teams evaluating text generation, translation, or summarization systems, it provides scores that align more closely with human judgment, particularly in personalized content generation and creative writing where multiple valid phrasings exist. 

Galileo is a cutting-edge evaluation and observability platform designed to empower developers to improve their AI apps that works with all major LLM providers. While BERTScore isn't a built-in metric, Galileo provides several offerings to integrate it into your evaluation workflows:

  • Custom code-based metrics: Galileo supports custom code-based metrics that let you implement BERTScore as a scorer function. You can define specific evaluation criteria for your LLM applications using either registered custom metrics (shared organization-wide) or local metrics (running in your notebook environment). 

  • Flexible execution environment: For registered custom metrics, Galileo runs them in a sandbox Python 3.10 environment, where you can install PyPI packages like bert-score using the uv script dependency format, giving you access to the BERTScore library directly in Galileo's platform. 

  • Local metrics for full library access: Local metrics let you use any library or custom Python code, including calling out to LLMs or other APIs, making it easy to run computationally intensive BERTScore evaluations using your local GPU resources. 

  • Integration with experiments and log streams: Custom BERTScore metrics can be used with experiments and Log streams, allowing you to track semantic similarity scores alongside Galileo's out-of-the-box metrics like hallucination detection, context adherence, and response quality. 

  • Aggregation and reporting: Galileo's custom metrics framework includes an aggregator function that aggregates individual scores into summary metrics across experiments, so you can compute average BERTScore, precision, recall, and F1 across your entire evaluation dataset. 

Book a demo to see how Galileo's custom metrics framework can help you implement BERTScore and other semantic evaluation metrics at scale.

FAQs

What is BERTScore and how does it work?

BERTScore is an NLP evaluation metric that uses BERT's contextual embeddings to measure semantic similarity between generated and reference texts. Unlike BLEU or ROUGE, it captures meaning rather than exact word matches, calculating precision, recall, and F1 scores based on cosine similarity between token embeddings. According to the ExPerT study published in ACL 2025 Findings, BERTScore achieves 59% alignment with human judgments compared to 47% for BLEU.

How do I calculate BERTScore in Python?

Install the bert-score library with pip install bert-score, then use from bert_score import score followed by P, R, F1 = score(candidates, references, lang="en", rescale_with_baseline=True). For production systems processing multiple evaluations, use the BERTScorer object to cache the model and improve efficiency. According to the official documentation, this caching is critical for computational efficiency when performing multiple evaluations in production environments. Run on a CUDA-enabled GPU for faster computation, though be aware of the computational overhead—BERTScore requires 5-10x more processing time compared to traditional metrics like BLEU and ROUGE.

What is a good BERTScore F1 value?

According to authoritative documentation, specific F1 score ranges for different NLP tasks are not published in official BERTScore resources or peer-reviewed research. The official baseline rescaling documentation focuses on improving score "readability" through rescaling but does not establish absolute performance thresholds. Instead, AI/ML engineers should establish task-specific baselines empirically through validation sets and human evaluation correlation studies rather than relying on universal score benchmarks that do not exist in authoritative sources.

Authoritative documentation does not provide specific F1 score ranges or universal thresholds for different NLP tasks because performance varies significantly by domain and task. Always use rescale_with_baseline=True for more interpretable scores. Rather than targeting absolute values, establish empirical baselines through validation sets with human evaluation correlation, then track relative improvements across model iterations.

How do IDF weighting and baseline rescaling enhance BERTScore results?

IDF weighting prioritizes rare and informative tokens in the text, reducing the influence of common words that provide minimal semantic value. By emphasizing these rare words, the evaluation captures more meaningful context, improving overall precision and recall scores. Enable IDF weighting by setting idf=True in the Python API. Baseline rescaling normalizes raw BERTScore values, making them more comparable across different evaluation scenarios and adjusting scores to a more interpretable scale. This is particularly useful in production environments where clear, comparable metrics are essential for decision-making—use rescale_with_baseline=True by default.

Evaluating the nuanced outputs delivered by large language models (LLMs) has long been a significant challenge in artificial intelligence. Traditional n-gram-based metrics such as BLEU and ROUGE often struggle to capture the semantic depth and context of human language, which is essential for tasks like machine translation and text summarization.

Models like BLEU and ROUGE rely on n-gram matching, which frequently fails to align with human judgments. BERTScore addresses this limitation by leveraging pre-trained contextual embeddings from transformer models (such as BERT, RoBERTa, and DeBERTa) to enable context-aware semantic similarity evaluation. 

This article explores BERTScore's methodology, its advantages over conventional metrics, and how to implement it effectively in production AI systems.

TLDR:

  • BERTScore evaluates text similarity using contextual embeddings from transformer models like BERT, RoBERTa, and DeBERTa, not word matching.

  • It calculates precision, recall, and F1 using cosine similarity between contextual embeddings, with optional IDF weighting and baseline rescaling for improved interpretability.

  • BERTScore demonstrates superior correlation with human judgment in semantic tasks like personalized text generation (59% vs 47-50% for BLEU/ROUGE), but performance is highly domain-dependent—BLEU remains competitive in machine translation (0.78-0.91 correlation), and correlations are weak in medical NLP contexts.

  • Use it for open-ended generation, summarization, and text generation evaluation, but always combine with traditional metrics and human evaluation in production systems.

What is BERTScore?

BERTScore is a semantic evaluation metric for natural language processing (NLP) that transcends surface-level word overlap by leveraging contextual embeddings generated by transformer models. This metric computes token-level cosine similarities between contextual embeddings and aggregates them into precision, recall, and F1 scores.

Unlike traditional n-gram-based metrics such as BLEU or ROUGE, BERTScore leverages pre-trained transformer models (BERT, RoBERTa, XLNet) to generate context-dependent token representations. 

This means it can recognize that "The cat sits on the mat" and "A feline rests upon a rug" convey the same meaning—even though they share no common words—because it understands semantic similarity rather than relying on exact word matches. This shift from quantity-based matching to quality-based understanding represents a fundamental advancement in how we evaluate AI-generated text.

The metric supports multiple pre-trained models including BERT, RoBERTa, and DeBERTa, with the official repository now recommending microsoft/deberta-xlarge-mnli for highest accuracy or microsoft/deberta-large-mnli for speed-optimized deployments, replacing the older roberta-large default.

Master LLM-as-a-Judge evaluation to ensure quality, catch failures, and build reliable AI apps

How BERTScore Works: Step-by-Step

Understanding BERTScore's calculation process helps practitioners configure it effectively and interpret results accurately.

1. Tokenization and Embedding Generation

Both candidate and reference texts are tokenized and passed through a pre-trained transformer model (such as BERT, RoBERTa, or XLNet). Each token receives a contextual embedding based on its surrounding context rather than a static word embedding—meaning the word "bank" gets different representations in "river bank" versus "bank account." 

This context-dependent approach represents the fundamental innovation that distinguishes BERTScore from traditional n-gram-based metrics like BLEU or ROUGE, enabling the metric to recognize semantic equivalence between paraphrases and synonyms that would be missed by exact string matching.

According to the official implementation, practitioners can specify which model to use via the model_type parameter and select specific transformer layers using num_layers.

2. Cosine Similarity Calculation

BERTScore computes pairwise cosine similarities between all tokens in the candidate sentence and all tokens in the reference sentence. This produces a similarity matrix where each element represents the contextual similarity between a token pair.

3. Precision, Recall, and F1 Scores

The metric calculates three values through greedy matching: precision (the average of maximum cosine similarities between candidate tokens and reference tokens), recall (the average of maximum similarities between reference tokens and candidate tokens), and F1 score (the harmonic mean of precision and recall).

  1. Precision: For each token in the candidate sentence, identify the token in the reference sentence with the maximum cosine similarity. Precision is the average of these maximum similarity scores across all candidate tokens.

  2. Recall: For each token in the reference sentence, find the most similar token in the candidate sentence. Recall is the average of these maximum similarities across all reference tokens.

  3. F1 Score: Computed as the harmonic mean of precision and recall, providing a balanced measure of overall semantic similarity.

4. IDF Weighting and Baseline Rescaling

Two optional enhancements improve BERTScore's effectiveness: IDF weighting, which emphasizes rare, informative tokens by weighting each token's cosine similarity contribution by its inverse document frequency value, and baseline rescaling for interpretability, which normalizes scores relative to baseline performance to make them more comparable across different evaluation scenarios.

IDF Weighting: Inverse document frequency weighting emphasizes rare, informative tokens over common words. Enable this by setting idf=True in the Python API.

Baseline Rescaling: Raw BERTScore values can be difficult to interpret. Rescaling normalizes scores relative to baseline performance, making them more comparable and "human-readable." The rescaling feature does not affect BERTScore's correlation with human judgment, as measured by Pearson's r and Kendall's τ coefficients. For production deployments, set rescale_with_baseline=True by default.

BERTScore vs BLEU vs ROUGE: When to Use Each

Choosing the right evaluation metric depends on your specific task and requirements. Here's how these metrics compare based on recent research:

Metric

Measures

Best For

Understands Meaning?

Speed

BLEU

N-gram precision

Machine translation

❌ No

✅ Fast

ROUGE

N-gram recall/overlap

Text summarization

❌ No

✅ Fast

BERTScore

Contextual semantic similarity

Text generation, paraphrase

✅ Yes

⚠️ Slower

Human Judgment Correlation

BERTScore demonstrates significantly stronger alignment with human evaluation compared to traditional metrics in specific task domains, though correlation varies substantially based on the application context. Research published in the ACL 2025 Findings documents quantitative human judgment correlation data across multiple NLP tasks through large-scale human evaluation with rigorous inter-annotator agreement protocols (0.823 Fleiss' kappa).

  • Personalized Long-Form Text Generation: BERTScore achieved 59% alignment with human majority voting in abstract generation, review writing, and topic writing tasks within the LongLaMP benchmark, outperforming BLEU (47%), ROUGE-L (50%), and METEOR (47%) by 12 percentage points.

  • Medical Text Summarization: According to the NPJ Digital Medicine systematic review, BERTScore demonstrated mixed correlation results across clinical tasks—Completeness metrics ranged from Pearson r = 0.28-0.44 and Spearman ρ = 0.15-0.645, while Correctness metrics showed Pearson r = 0.23-0.52 and Spearman ρ = 0.15-0.530. Notably, ROUGE exhibited catastrophic failure modes in certain medical contexts with negative Spearman correlations (ρ = -0.66 to -0.77), demonstrating the metric measured inverse quality.

  • Machine Translation Quality Estimation: The WMT 2025 Shared Task evaluated metrics across 8 language pairs, reporting system-level correlations of 0.759-0.904 for BERTScore and 0.781-0.908 for BLEU, with segment-level correlations of 0.579-0.728 for BERTScore and 0.575-0.720 for BLEU, demonstrating that BLEU remains competitive in translation evaluation despite BERTScore's theoretical semantic advantages.

The research consensus emphasizes that automated metrics should serve as preliminary screening tools paired with human evaluation, not standalone validity measures, with mandatory human evaluation required for medical and clinical applications where automated metric correlations prove insufficient.

According to the ExPerT study published in ACL 2025 Findings, researchers conducted large-scale human evaluation (100 examples, 3 annotators per example, inter-annotator agreement of 0.823 Fleiss' kappa) measuring alignment between automated metrics and human majority voting:

  • BERTScore: 59% alignment with human judgments

  • ROUGE-L: 50% alignment

  • BLEU: 47% alignment

This 12-percentage-point improvement demonstrates BERTScore's advantage when paraphrasing and semantic equivalence matter more than exact wording, as shown in personalized long-form text generation tasks like abstract and review writing.

When to Use Each Metric

Use BERTScore when:

  • Semantic accuracy matters more than exact wording, as demonstrated by 59% human alignment compared to 47-50% for traditional metrics in personalized text generation tasks

  • Evaluating paraphrases or creative text where multiple valid phrasings exist for semantically equivalent content

  • BLEU/ROUGE scores seem unfairly low for good outputs, particularly in open-ended generation scenarios

  • You have computational resources available, acknowledging the 5-10x performance overhead compared to traditional metrics

Use BLEU/ROUGE instead when:

  • Speed is critical (BERTScore requires transformer inference, with 5-10x computational overhead compared to n-gram metrics)

  • Exact terminology matters (legal, medical contexts where traditional metrics show stronger correlation in specific domains)

  • You need quick regression testing in CI/CD pipelines (BLEU/ROUGE enable millisecond-level latency vs transformer inference requirements)

  • Translation benchmarking in established workflows (BLEU achieves system-level correlations of 0.78-0.91, comparable to BERTScore)

The WMT 2025 benchmarks showed that BLEU achieved system-level correlations of 0.781-0.908 in machine translation—comparable to BERTScore (0.759-0.904) in this established domain.

Best Practices for Implementing BERTScore in AI Evals

To maximize BERTScore in your evaluations, follow specific best practices that are well-supported by research and successful implementations. From setting up evaluation pipelines to baseline calibration, let's explore some research-backed best practices.

Setting Up Evaluation Pipelines

Setting up a robust evals pipeline is always recommended when implementing BERTScore. Begin by installing necessary packages such as bert-score, transformers, and torch. These should be run on a CUDA-enabled GPU, as BERT models are computationally intensive. Here is a Python implementation to start with:

from bert_score import score
candidate = ["The quick brown fox jumps over the lazy dog"]
reference = ["A brown fox quickly jumps over a lazy dog"]
P, R, F1 = score(candidate, reference, lang="en")

Configure pipelines to support batch processing and custom model configurations for enhanced scalability and efficiency. Tools like Dask or Spark can be integrated for parallel processing of large datasets, ensuring that the evaluation remains accurate without compromising computation speed.

Framework Integration and Baseline Establishment

Seamlessly integrating BERTScore into existing frameworks requires careful attention to both technical dependencies and evaluation standards:

  • Technical Integration: Cache model weights to avoid repeated downloads, and align batch sizes with your system's memory capabilities.

  • Data Preprocessing: Consistent preprocessing of input texts is vital to maintaining data integrity across evaluations, while avoiding mixed language models in the same evaluation process ensures coherent results.

  • Baseline Configuration: As highlighted in Lukas Heller's article on BERTScore, setting the rescale_with_baseline parameter to True normalizes scores and provides consistency with traditional metrics like BLEU and ROUGE, producing more truthful precision scores across different evaluation scenarios.

  • IDF Weighting Implementation: Incorporating IDF weighting enhances BERTScore's evaluation precision by appropriately weighting rare words that often carry significant semantic value.

This systematic approach to framework integration and baseline establishment ensures that your BERTScore implementations function efficiently and provide evaluations that closely align with human interpretations of text quality. Teams can confidently deploy BERTScore across their evaluation pipelines while maintaining consistent and reliable assessment standards.

Monitoring and Maintenance

To maintain the efficacy of BERTScore, ongoing monitoring is essential. Identifying and addressing potential data errors, as measured by the Data Error Potential metric, helps ensure consistent and accurate evaluations. Establish logging mechanisms to track evaluation progress and manage errors effectively:

import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
try:
    scores = scorer.score(candidates, references)
except Exception as e:
    logger.error(f"Scoring failed: {e}")

Keep your libraries and model versions updated regularly to mitigate inconsistencies in scoring.

Common BERTScore Use Cases

BERTScore's semantic evaluation capabilities make it valuable across a wide range of NLP applications. Here are the most common production use cases where teams leverage BERTScore for quality assessment.

Machine Translation Evaluation

BERTScore has been adopted as an official evaluation metric in SemEval-2025 Task 2: Entity-Aware Machine Translation, representing current academic best practices for translation quality assessment. Research published in Nature Scientific Reports demonstrates hybrid approaches combining BERTScore with SVM classifiers for domain-specific translation assessment.

Text Summarization Assessment

BERTScore excels in evaluating text summarization tasks where capturing semantic fidelity matters more than exact word overlap. Traditional metrics like ROUGE often penalize summaries that use different phrasing to convey the same meaning, while BERTScore recognizes that "The company reported strong quarterly earnings" and "The firm announced robust financial results for the quarter" express equivalent information. 

This semantic understanding makes BERTScore particularly valuable for abstractive summarization, where models generate novel phrasing rather than extracting verbatim text. However, practitioners should note that domain-specific applications require careful validation—medical summarization contexts show weaker correlations with human judgment, emphasizing the need to combine BERTScore with domain-expert evaluation in specialized fields.

Read our case study with Magid to learn more about how Galileo empowers newsrooms and other content-forward organizations worldwide. 

LLM Content Quality Measurement

Research published in Nature Scientific Reports evaluated five state-of-the-art LLMs (LLaMA-3 70B, Gemini 2.5 Pro, DeepSeek R1, GPT Omni 3.0, Grok 3) for generating software task flows, using BERTScore alongside SBERT, USE, and hybrid metrics for measuring semantic similarity between human-annotated and AI-generated content. 

This implementation demonstrates how production systems combine BERTScore with complementary semantic and domain-specific metrics within comprehensive evaluation frameworks rather than deploying it in isolation.

Implementing BERTScore for Production AI Evaluation

BERTScore represents a significant advancement over traditional metrics by capturing semantic similarity rather than surface-level word matching. For teams evaluating text generation, translation, or summarization systems, it provides scores that align more closely with human judgment, particularly in personalized content generation and creative writing where multiple valid phrasings exist. 

Galileo is a cutting-edge evaluation and observability platform designed to empower developers to improve their AI apps that works with all major LLM providers. While BERTScore isn't a built-in metric, Galileo provides several offerings to integrate it into your evaluation workflows:

  • Custom code-based metrics: Galileo supports custom code-based metrics that let you implement BERTScore as a scorer function. You can define specific evaluation criteria for your LLM applications using either registered custom metrics (shared organization-wide) or local metrics (running in your notebook environment). 

  • Flexible execution environment: For registered custom metrics, Galileo runs them in a sandbox Python 3.10 environment, where you can install PyPI packages like bert-score using the uv script dependency format, giving you access to the BERTScore library directly in Galileo's platform. 

  • Local metrics for full library access: Local metrics let you use any library or custom Python code, including calling out to LLMs or other APIs, making it easy to run computationally intensive BERTScore evaluations using your local GPU resources. 

  • Integration with experiments and log streams: Custom BERTScore metrics can be used with experiments and Log streams, allowing you to track semantic similarity scores alongside Galileo's out-of-the-box metrics like hallucination detection, context adherence, and response quality. 

  • Aggregation and reporting: Galileo's custom metrics framework includes an aggregator function that aggregates individual scores into summary metrics across experiments, so you can compute average BERTScore, precision, recall, and F1 across your entire evaluation dataset. 

Book a demo to see how Galileo's custom metrics framework can help you implement BERTScore and other semantic evaluation metrics at scale.

FAQs

What is BERTScore and how does it work?

BERTScore is an NLP evaluation metric that uses BERT's contextual embeddings to measure semantic similarity between generated and reference texts. Unlike BLEU or ROUGE, it captures meaning rather than exact word matches, calculating precision, recall, and F1 scores based on cosine similarity between token embeddings. According to the ExPerT study published in ACL 2025 Findings, BERTScore achieves 59% alignment with human judgments compared to 47% for BLEU.

How do I calculate BERTScore in Python?

Install the bert-score library with pip install bert-score, then use from bert_score import score followed by P, R, F1 = score(candidates, references, lang="en", rescale_with_baseline=True). For production systems processing multiple evaluations, use the BERTScorer object to cache the model and improve efficiency. According to the official documentation, this caching is critical for computational efficiency when performing multiple evaluations in production environments. Run on a CUDA-enabled GPU for faster computation, though be aware of the computational overhead—BERTScore requires 5-10x more processing time compared to traditional metrics like BLEU and ROUGE.

What is a good BERTScore F1 value?

According to authoritative documentation, specific F1 score ranges for different NLP tasks are not published in official BERTScore resources or peer-reviewed research. The official baseline rescaling documentation focuses on improving score "readability" through rescaling but does not establish absolute performance thresholds. Instead, AI/ML engineers should establish task-specific baselines empirically through validation sets and human evaluation correlation studies rather than relying on universal score benchmarks that do not exist in authoritative sources.

Authoritative documentation does not provide specific F1 score ranges or universal thresholds for different NLP tasks because performance varies significantly by domain and task. Always use rescale_with_baseline=True for more interpretable scores. Rather than targeting absolute values, establish empirical baselines through validation sets with human evaluation correlation, then track relative improvements across model iterations.

How do IDF weighting and baseline rescaling enhance BERTScore results?

IDF weighting prioritizes rare and informative tokens in the text, reducing the influence of common words that provide minimal semantic value. By emphasizing these rare words, the evaluation captures more meaningful context, improving overall precision and recall scores. Enable IDF weighting by setting idf=True in the Python API. Baseline rescaling normalizes raw BERTScore values, making them more comparable across different evaluation scenarios and adjusting scores to a more interpretable scale. This is particularly useful in production environments where clear, comparable metrics are essential for decision-making—use rescale_with_baseline=True by default.

Evaluating the nuanced outputs delivered by large language models (LLMs) has long been a significant challenge in artificial intelligence. Traditional n-gram-based metrics such as BLEU and ROUGE often struggle to capture the semantic depth and context of human language, which is essential for tasks like machine translation and text summarization.

Models like BLEU and ROUGE rely on n-gram matching, which frequently fails to align with human judgments. BERTScore addresses this limitation by leveraging pre-trained contextual embeddings from transformer models (such as BERT, RoBERTa, and DeBERTa) to enable context-aware semantic similarity evaluation. 

This article explores BERTScore's methodology, its advantages over conventional metrics, and how to implement it effectively in production AI systems.

TLDR:

  • BERTScore evaluates text similarity using contextual embeddings from transformer models like BERT, RoBERTa, and DeBERTa, not word matching.

  • It calculates precision, recall, and F1 using cosine similarity between contextual embeddings, with optional IDF weighting and baseline rescaling for improved interpretability.

  • BERTScore demonstrates superior correlation with human judgment in semantic tasks like personalized text generation (59% vs 47-50% for BLEU/ROUGE), but performance is highly domain-dependent—BLEU remains competitive in machine translation (0.78-0.91 correlation), and correlations are weak in medical NLP contexts.

  • Use it for open-ended generation, summarization, and text generation evaluation, but always combine with traditional metrics and human evaluation in production systems.

What is BERTScore?

BERTScore is a semantic evaluation metric for natural language processing (NLP) that transcends surface-level word overlap by leveraging contextual embeddings generated by transformer models. This metric computes token-level cosine similarities between contextual embeddings and aggregates them into precision, recall, and F1 scores.

Unlike traditional n-gram-based metrics such as BLEU or ROUGE, BERTScore leverages pre-trained transformer models (BERT, RoBERTa, XLNet) to generate context-dependent token representations. 

This means it can recognize that "The cat sits on the mat" and "A feline rests upon a rug" convey the same meaning—even though they share no common words—because it understands semantic similarity rather than relying on exact word matches. This shift from quantity-based matching to quality-based understanding represents a fundamental advancement in how we evaluate AI-generated text.

The metric supports multiple pre-trained models including BERT, RoBERTa, and DeBERTa, with the official repository now recommending microsoft/deberta-xlarge-mnli for highest accuracy or microsoft/deberta-large-mnli for speed-optimized deployments, replacing the older roberta-large default.

Master LLM-as-a-Judge evaluation to ensure quality, catch failures, and build reliable AI apps

How BERTScore Works: Step-by-Step

Understanding BERTScore's calculation process helps practitioners configure it effectively and interpret results accurately.

1. Tokenization and Embedding Generation

Both candidate and reference texts are tokenized and passed through a pre-trained transformer model (such as BERT, RoBERTa, or XLNet). Each token receives a contextual embedding based on its surrounding context rather than a static word embedding—meaning the word "bank" gets different representations in "river bank" versus "bank account." 

This context-dependent approach represents the fundamental innovation that distinguishes BERTScore from traditional n-gram-based metrics like BLEU or ROUGE, enabling the metric to recognize semantic equivalence between paraphrases and synonyms that would be missed by exact string matching.

According to the official implementation, practitioners can specify which model to use via the model_type parameter and select specific transformer layers using num_layers.

2. Cosine Similarity Calculation

BERTScore computes pairwise cosine similarities between all tokens in the candidate sentence and all tokens in the reference sentence. This produces a similarity matrix where each element represents the contextual similarity between a token pair.

3. Precision, Recall, and F1 Scores

The metric calculates three values through greedy matching: precision (the average of maximum cosine similarities between candidate tokens and reference tokens), recall (the average of maximum similarities between reference tokens and candidate tokens), and F1 score (the harmonic mean of precision and recall).

  1. Precision: For each token in the candidate sentence, identify the token in the reference sentence with the maximum cosine similarity. Precision is the average of these maximum similarity scores across all candidate tokens.

  2. Recall: For each token in the reference sentence, find the most similar token in the candidate sentence. Recall is the average of these maximum similarities across all reference tokens.

  3. F1 Score: Computed as the harmonic mean of precision and recall, providing a balanced measure of overall semantic similarity.

4. IDF Weighting and Baseline Rescaling

Two optional enhancements improve BERTScore's effectiveness: IDF weighting, which emphasizes rare, informative tokens by weighting each token's cosine similarity contribution by its inverse document frequency value, and baseline rescaling for interpretability, which normalizes scores relative to baseline performance to make them more comparable across different evaluation scenarios.

IDF Weighting: Inverse document frequency weighting emphasizes rare, informative tokens over common words. Enable this by setting idf=True in the Python API.

Baseline Rescaling: Raw BERTScore values can be difficult to interpret. Rescaling normalizes scores relative to baseline performance, making them more comparable and "human-readable." The rescaling feature does not affect BERTScore's correlation with human judgment, as measured by Pearson's r and Kendall's τ coefficients. For production deployments, set rescale_with_baseline=True by default.

BERTScore vs BLEU vs ROUGE: When to Use Each

Choosing the right evaluation metric depends on your specific task and requirements. Here's how these metrics compare based on recent research:

Metric

Measures

Best For

Understands Meaning?

Speed

BLEU

N-gram precision

Machine translation

❌ No

✅ Fast

ROUGE

N-gram recall/overlap

Text summarization

❌ No

✅ Fast

BERTScore

Contextual semantic similarity

Text generation, paraphrase

✅ Yes

⚠️ Slower

Human Judgment Correlation

BERTScore demonstrates significantly stronger alignment with human evaluation compared to traditional metrics in specific task domains, though correlation varies substantially based on the application context. Research published in the ACL 2025 Findings documents quantitative human judgment correlation data across multiple NLP tasks through large-scale human evaluation with rigorous inter-annotator agreement protocols (0.823 Fleiss' kappa).

  • Personalized Long-Form Text Generation: BERTScore achieved 59% alignment with human majority voting in abstract generation, review writing, and topic writing tasks within the LongLaMP benchmark, outperforming BLEU (47%), ROUGE-L (50%), and METEOR (47%) by 12 percentage points.

  • Medical Text Summarization: According to the NPJ Digital Medicine systematic review, BERTScore demonstrated mixed correlation results across clinical tasks—Completeness metrics ranged from Pearson r = 0.28-0.44 and Spearman ρ = 0.15-0.645, while Correctness metrics showed Pearson r = 0.23-0.52 and Spearman ρ = 0.15-0.530. Notably, ROUGE exhibited catastrophic failure modes in certain medical contexts with negative Spearman correlations (ρ = -0.66 to -0.77), demonstrating the metric measured inverse quality.

  • Machine Translation Quality Estimation: The WMT 2025 Shared Task evaluated metrics across 8 language pairs, reporting system-level correlations of 0.759-0.904 for BERTScore and 0.781-0.908 for BLEU, with segment-level correlations of 0.579-0.728 for BERTScore and 0.575-0.720 for BLEU, demonstrating that BLEU remains competitive in translation evaluation despite BERTScore's theoretical semantic advantages.

The research consensus emphasizes that automated metrics should serve as preliminary screening tools paired with human evaluation, not standalone validity measures, with mandatory human evaluation required for medical and clinical applications where automated metric correlations prove insufficient.

According to the ExPerT study published in ACL 2025 Findings, researchers conducted large-scale human evaluation (100 examples, 3 annotators per example, inter-annotator agreement of 0.823 Fleiss' kappa) measuring alignment between automated metrics and human majority voting:

  • BERTScore: 59% alignment with human judgments

  • ROUGE-L: 50% alignment

  • BLEU: 47% alignment

This 12-percentage-point improvement demonstrates BERTScore's advantage when paraphrasing and semantic equivalence matter more than exact wording, as shown in personalized long-form text generation tasks like abstract and review writing.

When to Use Each Metric

Use BERTScore when:

  • Semantic accuracy matters more than exact wording, as demonstrated by 59% human alignment compared to 47-50% for traditional metrics in personalized text generation tasks

  • Evaluating paraphrases or creative text where multiple valid phrasings exist for semantically equivalent content

  • BLEU/ROUGE scores seem unfairly low for good outputs, particularly in open-ended generation scenarios

  • You have computational resources available, acknowledging the 5-10x performance overhead compared to traditional metrics

Use BLEU/ROUGE instead when:

  • Speed is critical (BERTScore requires transformer inference, with 5-10x computational overhead compared to n-gram metrics)

  • Exact terminology matters (legal, medical contexts where traditional metrics show stronger correlation in specific domains)

  • You need quick regression testing in CI/CD pipelines (BLEU/ROUGE enable millisecond-level latency vs transformer inference requirements)

  • Translation benchmarking in established workflows (BLEU achieves system-level correlations of 0.78-0.91, comparable to BERTScore)

The WMT 2025 benchmarks showed that BLEU achieved system-level correlations of 0.781-0.908 in machine translation—comparable to BERTScore (0.759-0.904) in this established domain.

Best Practices for Implementing BERTScore in AI Evals

To maximize BERTScore in your evaluations, follow specific best practices that are well-supported by research and successful implementations. From setting up evaluation pipelines to baseline calibration, let's explore some research-backed best practices.

Setting Up Evaluation Pipelines

Setting up a robust evals pipeline is always recommended when implementing BERTScore. Begin by installing necessary packages such as bert-score, transformers, and torch. These should be run on a CUDA-enabled GPU, as BERT models are computationally intensive. Here is a Python implementation to start with:

from bert_score import score
candidate = ["The quick brown fox jumps over the lazy dog"]
reference = ["A brown fox quickly jumps over a lazy dog"]
P, R, F1 = score(candidate, reference, lang="en")

Configure pipelines to support batch processing and custom model configurations for enhanced scalability and efficiency. Tools like Dask or Spark can be integrated for parallel processing of large datasets, ensuring that the evaluation remains accurate without compromising computation speed.

Framework Integration and Baseline Establishment

Seamlessly integrating BERTScore into existing frameworks requires careful attention to both technical dependencies and evaluation standards:

  • Technical Integration: Cache model weights to avoid repeated downloads, and align batch sizes with your system's memory capabilities.

  • Data Preprocessing: Consistent preprocessing of input texts is vital to maintaining data integrity across evaluations, while avoiding mixed language models in the same evaluation process ensures coherent results.

  • Baseline Configuration: As highlighted in Lukas Heller's article on BERTScore, setting the rescale_with_baseline parameter to True normalizes scores and provides consistency with traditional metrics like BLEU and ROUGE, producing more truthful precision scores across different evaluation scenarios.

  • IDF Weighting Implementation: Incorporating IDF weighting enhances BERTScore's evaluation precision by appropriately weighting rare words that often carry significant semantic value.

This systematic approach to framework integration and baseline establishment ensures that your BERTScore implementations function efficiently and provide evaluations that closely align with human interpretations of text quality. Teams can confidently deploy BERTScore across their evaluation pipelines while maintaining consistent and reliable assessment standards.

Monitoring and Maintenance

To maintain the efficacy of BERTScore, ongoing monitoring is essential. Identifying and addressing potential data errors, as measured by the Data Error Potential metric, helps ensure consistent and accurate evaluations. Establish logging mechanisms to track evaluation progress and manage errors effectively:

import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
try:
    scores = scorer.score(candidates, references)
except Exception as e:
    logger.error(f"Scoring failed: {e}")

Keep your libraries and model versions updated regularly to mitigate inconsistencies in scoring.

Common BERTScore Use Cases

BERTScore's semantic evaluation capabilities make it valuable across a wide range of NLP applications. Here are the most common production use cases where teams leverage BERTScore for quality assessment.

Machine Translation Evaluation

BERTScore has been adopted as an official evaluation metric in SemEval-2025 Task 2: Entity-Aware Machine Translation, representing current academic best practices for translation quality assessment. Research published in Nature Scientific Reports demonstrates hybrid approaches combining BERTScore with SVM classifiers for domain-specific translation assessment.

Text Summarization Assessment

BERTScore excels in evaluating text summarization tasks where capturing semantic fidelity matters more than exact word overlap. Traditional metrics like ROUGE often penalize summaries that use different phrasing to convey the same meaning, while BERTScore recognizes that "The company reported strong quarterly earnings" and "The firm announced robust financial results for the quarter" express equivalent information. 

This semantic understanding makes BERTScore particularly valuable for abstractive summarization, where models generate novel phrasing rather than extracting verbatim text. However, practitioners should note that domain-specific applications require careful validation—medical summarization contexts show weaker correlations with human judgment, emphasizing the need to combine BERTScore with domain-expert evaluation in specialized fields.

Read our case study with Magid to learn more about how Galileo empowers newsrooms and other content-forward organizations worldwide. 

LLM Content Quality Measurement

Research published in Nature Scientific Reports evaluated five state-of-the-art LLMs (LLaMA-3 70B, Gemini 2.5 Pro, DeepSeek R1, GPT Omni 3.0, Grok 3) for generating software task flows, using BERTScore alongside SBERT, USE, and hybrid metrics for measuring semantic similarity between human-annotated and AI-generated content. 

This implementation demonstrates how production systems combine BERTScore with complementary semantic and domain-specific metrics within comprehensive evaluation frameworks rather than deploying it in isolation.

Implementing BERTScore for Production AI Evaluation

BERTScore represents a significant advancement over traditional metrics by capturing semantic similarity rather than surface-level word matching. For teams evaluating text generation, translation, or summarization systems, it provides scores that align more closely with human judgment, particularly in personalized content generation and creative writing where multiple valid phrasings exist. 

Galileo is a cutting-edge evaluation and observability platform designed to empower developers to improve their AI apps that works with all major LLM providers. While BERTScore isn't a built-in metric, Galileo provides several offerings to integrate it into your evaluation workflows:

  • Custom code-based metrics: Galileo supports custom code-based metrics that let you implement BERTScore as a scorer function. You can define specific evaluation criteria for your LLM applications using either registered custom metrics (shared organization-wide) or local metrics (running in your notebook environment). 

  • Flexible execution environment: For registered custom metrics, Galileo runs them in a sandbox Python 3.10 environment, where you can install PyPI packages like bert-score using the uv script dependency format, giving you access to the BERTScore library directly in Galileo's platform. 

  • Local metrics for full library access: Local metrics let you use any library or custom Python code, including calling out to LLMs or other APIs, making it easy to run computationally intensive BERTScore evaluations using your local GPU resources. 

  • Integration with experiments and log streams: Custom BERTScore metrics can be used with experiments and Log streams, allowing you to track semantic similarity scores alongside Galileo's out-of-the-box metrics like hallucination detection, context adherence, and response quality. 

  • Aggregation and reporting: Galileo's custom metrics framework includes an aggregator function that aggregates individual scores into summary metrics across experiments, so you can compute average BERTScore, precision, recall, and F1 across your entire evaluation dataset. 

Book a demo to see how Galileo's custom metrics framework can help you implement BERTScore and other semantic evaluation metrics at scale.

FAQs

What is BERTScore and how does it work?

BERTScore is an NLP evaluation metric that uses BERT's contextual embeddings to measure semantic similarity between generated and reference texts. Unlike BLEU or ROUGE, it captures meaning rather than exact word matches, calculating precision, recall, and F1 scores based on cosine similarity between token embeddings. According to the ExPerT study published in ACL 2025 Findings, BERTScore achieves 59% alignment with human judgments compared to 47% for BLEU.

How do I calculate BERTScore in Python?

Install the bert-score library with pip install bert-score, then use from bert_score import score followed by P, R, F1 = score(candidates, references, lang="en", rescale_with_baseline=True). For production systems processing multiple evaluations, use the BERTScorer object to cache the model and improve efficiency. According to the official documentation, this caching is critical for computational efficiency when performing multiple evaluations in production environments. Run on a CUDA-enabled GPU for faster computation, though be aware of the computational overhead—BERTScore requires 5-10x more processing time compared to traditional metrics like BLEU and ROUGE.

What is a good BERTScore F1 value?

According to authoritative documentation, specific F1 score ranges for different NLP tasks are not published in official BERTScore resources or peer-reviewed research. The official baseline rescaling documentation focuses on improving score "readability" through rescaling but does not establish absolute performance thresholds. Instead, AI/ML engineers should establish task-specific baselines empirically through validation sets and human evaluation correlation studies rather than relying on universal score benchmarks that do not exist in authoritative sources.

Authoritative documentation does not provide specific F1 score ranges or universal thresholds for different NLP tasks because performance varies significantly by domain and task. Always use rescale_with_baseline=True for more interpretable scores. Rather than targeting absolute values, establish empirical baselines through validation sets with human evaluation correlation, then track relative improvements across model iterations.

How do IDF weighting and baseline rescaling enhance BERTScore results?

IDF weighting prioritizes rare and informative tokens in the text, reducing the influence of common words that provide minimal semantic value. By emphasizing these rare words, the evaluation captures more meaningful context, improving overall precision and recall scores. Enable IDF weighting by setting idf=True in the Python API. Baseline rescaling normalizes raw BERTScore values, making them more comparable across different evaluation scenarios and adjusting scores to a more interpretable scale. This is particularly useful in production environments where clear, comparable metrics are essential for decision-making—use rescale_with_baseline=True by default.

Jackson Wells