Evaluating AI Applications: Understanding the Semantic Textual Similarity (STS) Metric

Conor Bronsdon
Conor BronsdonHead of Developer Awareness
AI Evaluation Using STS Metric
6 min readMarch 26 2025

Imagine a customer service AI that flags the messages "My account was compromised yesterday" and "Someone hacked into my account" as completely unrelated tickets, routing them to different departments despite their identical meaning.

Meanwhile, it incorrectly groups "I need to close my bank account" and "I need to close my account to stop fraudulent charges" as identical issues, missing the critical context of fraud that requires urgent attention. While basic algorithms can identify exact word matches, they miss the deeper semantic connections that humans naturally understand.

Enter Semantic Textual Similarity (STS), a sophisticated measurement metric that evaluates how closely two texts align in meaning rather than merely matching words. This article explores the mathematical foundations of STS, its implementation approaches, and how to select the right method for your specific AI application needs.

What is the Semantic Textual Similarity (STS) Metric?

The Semantic Textual Similarity (STS) metric measures how closely two texts match in meaning, playing a vital role in Natural Language Processing (NLP). Unlike traditional methods that focus on word overlap or syntax, the STS metric delves into the intent behind phrases and the nuanced concepts they convey. This allows it to capture subtle variations beyond simple word matching, leading to a richer understanding of text.

At its core, the STS metric relies on contextual interpretation. Rather than just comparing surface details, it evaluates whether two texts express the same ideas. For example, consider the sentences "A large cat sat on the mat" and "A big feline lounged on the carpet."

Although the words differ, the STS metric recognizes their similar meanings by understanding synonyms and shared contexts. Over the years, the STS metric has evolved from basic statistical methods to advanced deep learning techniques, reflecting the progression from traditional NLP models to LLMs.

The fundamental calculations behind STS metrics involve representing text as vectors and measuring their similarity through various mathematical methods. But what does "representing text as vectors" actually mean?

In simple terms, it's about converting words and sentences into numbers that computers can process. Imagine each word or text being transformed into a set of coordinates in a multi-dimensional space – similar to how we plot points on a graph, but with many more dimensions. In this space, texts with similar meanings end up positioned close together, while unrelated texts appear far apart.

When measuring similarity, the system calculates how close these points are to each other. The closer they are, the more similar their meanings. This is comparable to how cities geographically near each other on a map are likely to share similar weather patterns. The different STS methods we'll explore vary in how they create these text vectors and calculate the distances between them, each with its own balance of simplicity, speed, and accuracy.

Subscribe to Chain of Thought, the podcast for software engineers and leaders building the GenAI revolution.
Subscribe to Chain of Thought, the podcast for software engineers and leaders building the GenAI revolution.

STS Metric Implementation Method #1: Vector Space Models and Cosine Similarity

At its mathematical core, cosine similarity measures the angle between two vectors in a multi-dimensional space:

  • Cosine Similarity = (A · B) / (||A|| × ||B||)

Where:

  • A · B represents the dot product of vectors A and B
  • ||A|| and ||B|| are the Euclidean norms (magnitudes) of vectors A and B

This calculation yields values between -1 and 1, where:

  • 1 indicates perfect similarity
  • 0 indicates no similarity
  • -1 indicates perfect dissimilarity (though rarely used in text similarity)

For example, consider two simple sentences:

  • Sentence 1: "The movie was excellent"
  • Sentence 2: "The film was great"

If we represent these as TF-IDF vectors in a simplified 5-dimensional space:

  • Sentence 1 vector: [0.2, 0.5, 0, 0.8, 0]
  • Sentence 2 vector: [0.2, 0, 0.5, 0, 0.8]

Cosine similarity calculation:

  • Dot product = (0.2 × 0.2) + (0.5 × 0) + (0 × 0.5) + (0.8 × 0) + (0 × 0.8) = 0.04
  • ||Sentence 1|| = √(0.2² + 0.5² + 0² + 0.8² + 0²) = √(0.89) = 0.943
  • ||Sentence 2|| = √(0.2² + 0² + 0.5² + 0² + 0.8²) = √(0.89) = 0.943
  • Cosine Similarity = 0.04 / (0.943 × 0.943) = 0.045

This low score reflects the limitation of simple vector models that don't capture the semantic relationship between "movie" and "film" or "excellent" and "great."

The vector space approach can be implemented using scikit-learn's TF-IDF vectorizer and cosine similarity functions:

1from sklearn.feature_extraction.text import TfidfVectorizer
2from sklearn.metrics.pairwise import cosine_similarity
3import numpy as np
4
5def vector_space_similarity(text1, text2):
6    # Create TF-IDF vectorizer
7    vectorizer = TfidfVectorizer()
8    
9    # Fit and transform texts
10    tfidf_matrix = vectorizer.fit_transform([text1, text2])
11    
12    # Calculate cosine similarity
13    similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
14    return similarity
15
16# Example usage
17similarity_score = vector_space_similarity(
18    "The movie was excellent", 
19    "The film was great"
20)
21print(f"Similarity score: {similarity_score:.4f}")
22

This implementation transforms text into sparse vectors where each dimension represents a term in the vocabulary. The approach is computationally efficient but limited in capturing semantic relationships between different words.

STS Metric Implementation Method #2: Word Embedding Aggregation

When using word embeddings like Word2Vec or GloVe, an effective approach is to:

  1. Convert each word to its corresponding vector (typically 100-300 dimensions)
  2. Aggregate these vectors (often by averaging) to create a single document vector
  3. Apply cosine similarity between document vectors

The mathematical formula for creating a document vector via averaging:

  • Doc_Vector = (1/n) × Σ(Word_Vector_i)

Where n is the number of words in the document.

For example, using the same sentences and pretrained 300-dimensional word embeddings:

  • "The" might be represented as [0.1, 0.2, ..., -0.3]
  • "movie" might be represented as [0.5, -0.1, ..., 0.7]

After averaging all word vectors for each sentence, we'd calculate cosine similarity between these aggregate vectors. Using word embeddings, our two sentences might yield a similarity score of 0.72, much higher than the previous method because these embeddings capture that "movie" and "film" are semantically related.

Word embedding aggregation can be implemented using pre-trained models from libraries like spaCy:

1import numpy as np
2import spacy
3
4def word_embedding_similarity(text1, text2):
5    # Load pre-trained word vectors
6    nlp = spacy.load('en_core_web_md')
7    
8    # Process texts
9    doc1 = nlp(text1)
10    doc2 = nlp(text2)
11    
12    # Calculate document vectors (average of word vectors)
13    if len(doc1) == 0 or len(doc2) == 0:
14        return 0.0
15    
16    # Use spaCy's built-in similarity which uses cosine similarity
17    # on document vectors (averages of word vectors)
18    similarity = doc1.similarity(doc2)
19    return similarity
20
21# Example usage
22similarity_score = word_embedding_similarity(
23    "The movie was excellent", 
24    "The film was great"
25)
26print(f"Similarity score: {similarity_score:.4f}")
27

This method leverages pre-trained word vectors to capture semantic relationships, offering a good balance between computational efficiency and semantic understanding. It works well for general-purpose text but may require domain adaptation for specialized content.

When selecting embedding models, it's important to consider factors such as vocabulary size, domain specificity, and computational resources.

STS Metric Implementation Method #3: Transformer-Based Similarity Calculations

The breakthrough in STS came with transformer models like BERT, which process text contextually:

  • Similarity_Score = CosineSimilarity(BERT_Encoder(Text_A), BERT_Encoder(Text_B))

BERT's contextual encoding captures nuanced meaning through two mechanisms.

First is the self-attention mechanism, which weighs the importance of each word relative to all others in the sentence.

  • Attention(Q, K, V) = softmax(QK^T/√d_k)V

Where Q (query), K (key), and V (value) are transformations of the input, and d_k is a scaling factor.

Second is the Bidirectional Context, where BERT processes text in both directions simultaneously, allowing it to understand the context from the full sentence.

For idiomatic expressions like "burn the midnight oil" vs. "pull an all-nighter" traditional methods might miss the similarity, seeing only different words. BERT, however, encodes the contextual meaning and might give these phrases a similarity score of 0.85, recognizing they both refer to working late despite using completely different words.

Similarly, BERT would understand that "The bank approved my loan" and "The financial institution accepted my credit application" are highly similar (perhaps 0.88), even though they share few words, because it has learned the contextual relationship between "bank" and "financial institution," and between "approved loan" and "accepted credit application."

This contextual understanding gives transformer models a significant advantage when evaluating semantically complex text pairs, especially those containing:

  • Polysemy (words with multiple meanings)
  • Synonymy at the phrase level
  • Idiomatic expressions
  • Complex syntactic structures that alter meaning

Additionally, evaluation metrics like BERTScore leverage these transformer models to provide more accurate text similarity measures.

Transformer models can be implemented using the Sentence-Transformers library, which simplifies working with models like BERT:

1from sentence_transformers import SentenceTransformer
2from sklearn.metrics.pairwise import cosine_similarity
3
4def transformer_similarity(text1, text2):
5    # Load pre-trained Sentence-BERT model
6    model = SentenceTransformer('all-MiniLM-L6-v2')
7    
8    # Encode sentences to get embeddings
9    embedding1 = model.encode([text1])[0]
10    embedding2 = model.encode([text2])[0]
11    
12    # Calculate cosine similarity
13    similarity = cosine_similarity([embedding1], [embedding2])[0][0]
14    return similarity
15
16# Example usage
17similarity_score = transformer_similarity(
18    "The bank approved my loan application yesterday",
19    "My credit request was accepted by the financial institution"
20)
21print(f"Similarity score: {similarity_score:.4f}")
22

While more resource-intensive, this approach provides state-of-the-art semantic understanding, particularly for complex language phenomena like idioms, context-dependent meanings, and nuanced semantic relationships. Innovations like Retrieval-Augmented Generation can further improve performance by combining transformer models with external knowledge sources.

For comparing a text against large collections (e.g., millions of documents), the exact similarity is often too slow. Libraries like Facebook AI Similarity Search (FAISS) provide efficient similarity search:

1import faiss
2import numpy as np
3
4# Assuming we have a collection of document embeddings
5embeddings = np.array([...])  # Shape: (num_docs, embedding_dim)
6
7# Normalize vectors for cosine similarity
8norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
9normalized_embeddings = embeddings / norms
10
11# Create inner product index (cosine similarity when vectors are normalized)
12dimension = embeddings.shape[1]
13index = faiss.IndexFlatIP(dimension)
14index.add(normalized_embeddings)
15
16# Search for most similar documents
17query_embedding = model.encode("query text")
18# Normalize query vector
19query_embedding = query_embedding / np.linalg.norm(query_embedding)
20k = 5  # Number of similar documents to retrieve
21distances, indices = index.search(np.array([query_embedding]), k)
22

Choosing the Right STS Method: Vector Space vs Word Embeddings vs Transformers

When implementing semantic textual similarity for your specific application, choosing the appropriate method depends on several key factors:

FactorMethod #1: Vector SpaceMethod #2: Word EmbeddingsMethod #3: Transformers
AccuracyLow-MediumMedium-HighHigh
ResourcesMinimalModerateHigh
SpeedVery FastFastSlow-Moderate
Implementation ComplexitySimpleModerateComplex
Best ForQuick prototyping, resource-constrained environmentsBalanced accuracy and performanceMaximum accuracy, complex language understanding

For example, consider a customer support application analyzing ticket similarity as a case study:

  • If handling thousands of tickets per minute with limited server resources → Method #2
  • If accuracy is paramount for complex technical content → Method #3
  • If implementing on edge devices with strict memory limitations → Method #1

This evaluation framework helps teams make informed decisions based on their specific constraints and requirements rather than simply defaulting to the newest or most complex approach.

Addressing Common STS Implementation Challenges in AI Evaluation

Let's examine technological solutions to some STS metric implementation challenges.

Ambiguity in Language and Context

Ambiguity arises when words or phrases have multiple meanings depending on the context. Traditional methods like TF-IDF or bag-of-words often overlook these nuances, resulting in inconsistent similarity assessments.

For example, the phrase "bank interest" could refer to financial interest rates or concerns about riverbanks in environmental contexts, each requiring different interpretations.

Modern approaches such as Word Sense Disambiguation (WSD) and context-aware embeddings address this issue more effectively. Models like BERT utilize bidirectional training, considering surrounding words from both sides to enhance context understanding.

Additionally, in advanced AI systems, managing such ambiguities is crucial for accuracy and preventing errors like unintended outputs or hallucinations. Implementing techniques for detecting hallucinations is vital for ensuring the reliability of AI models.

Computational Efficiency in Large-Scale STS Metric Calculations

Computational efficiency is a significant concern when performing STS metric calculations on large-scale datasets. Traditional methods often involve pairwise comparisons between texts, which can become computationally infeasible as the dataset size increases.

To address this, modern approaches employ techniques like Approximate Nearest Neighbor (ANN) search, advanced chunking techniques, and sparse vector representations to reduce processing times while maintaining accuracy. By distributing tasks across multiple processors and optimizing hardware utilization, systems can remain responsive even when handling vast amounts of text data.

Galileo addresses large-scale STS metric challenges by integrating efficient computational strategies and semantic modeling. By leveraging distributed processing frameworks and optimizing resource allocation, Galileo minimizes the trade-off between performance and accuracy.

Enhancing Text Similarity Analysis With Galileo

To advance text similarity analysis, it's essential to utilize tools that extend beyond traditional approaches. Galileo provides a suite of capabilities designed to enhance Semantic Textual Similarity (STS) metric applications:

  • Data Enrichment: By integrating external data sources, Galileo boosts contextual understanding, leading to improved model accuracy in capturing semantic nuances.
  • Advanced Similarity Scoring: Galileo offers sophisticated algorithms that deliver precise similarity comparisons across diverse textual contexts.
  • Anomaly Detection: The platform identifies outliers in text data, ensuring your datasets remain clean and reliable for analysis.
  • Visual Analytics: Galileo provides compelling visualizations that make complex data more accessible, facilitating insightful decision-making.
  • Collaboration Tools: Teams can share data and collaborate in real time within Galileo, promoting efficiency and alignment across projects.

Get started with Galileo's Guardrail Metrics today to ensure your models maintain high-performance standards in production.