Meet Galileo Luna: Evaluation Foundation Models

Low latency, low cost, high accuracy GenAI Evaluation is finally here. No more ask-GPT and painstaking vibe checks.

Since our inception in early 2021, language model evaluations have been at Galileo’s core. Evaluations are a lynchpin to productionizing and ensuring the trustworthy enterprise-wide adoption of generative AI applications. Despite this, many teams still rely on vibe checks and manual ad-hoc human evaluations to evaluate model outputs.

More recently, using LLMs like GPT-4 to evaluate other LLM outputs has become common practice, with much research [1][2] being published on the subject. Last year Galileo published and launched Chainpoll, our own chain-of-thought powered method for LLM-based hallucination evaluations that proved to be more accurate than asking GPT models.

While Chainpoll helped our customers accurately detect hallucinations, we quickly realized that in production applications, evaluations need to be ultra-low-latency and cost-efficient. With LLM-powered evaluations taking multiple seconds to run and costing tens of thousands of dollars at production scale, we needed an entirely new approach for our customers to scale evaluations.

Nearly a year of painstaking R&D later, we are excited to share Galileo Luna – low latency, low cost, high accuracy models for GenAI evaluation. No more painstaking vibe checks and asking GPT. The 5 breakthroughs in GenAI Evaluations with Galileo Luna™

Leading Evaluation Accuracy Benchmarks
Ultra Low-Cost Evaluation
Ultra Low Latency Evaluation
Detect Hallucinations, Security and Data Privacy: No Ground Truth Required!
Built for Customizability: Make Luna your own!

Luna Leads Evaluation Accuracy Benchmarks

Out of the box, when it comes to evaluation accuracy, Galileo Luna has proven to outperform all popular evaluation techniques, including our own Chainpoll methodology. The above graph is for the Context Adherence Luna Evaluation Foundation Model (EFM) which detects hallucinations in RAG-based systems. During testing against popular publicly available datasets that cover multiple industry verticals, Luna proved to be 18% more accurate than evaluating with OpenAI’s GPT-3.5 at detecting hallucinations. We are seeing similar performance for evaluation tasks such as prompt injections, PII detection, and more.

For our customers focused on hallucination prevention at enterprise scale, Luna has become the ‘first line of defense.’ While we recommend humans still remain in the loop, Luna has helped these organizations dramatically improve evaluation speed and accuracy.

Ultra Low-Cost Evaluation

The next major hurdle we looked at was cost. Luna helps AI teams reduce evaluation cost in two ways. First, Luna replaces costly LLM-based evaluations, which for some customers, exceeds $1M per year at production scale. Second, Luna helps teams reduce their reliance on human-in-the-loop evaluations. In our testing, Galileo Luna proved 97% cheaper than OpenAI’s GPT-3.5 when evaluating production traffic!

Millisecond Latencies for Evaluation

The final hurdle we overcame was latency. Low latency is vital for a seamless user experience—imagine waiting 5-seconds for a response from a chatbot! We hold our evaluations to the same rigorous standards, particularly as they become integral to the end-user experience.

To this effect, Luna EFMs have been built to evaluate LLM responses in milliseconds without compromising accuracy. In our tests for hallucination detection, Galileo proved 11x faster than using GPT-3.5. Luna was critical to launching Galileo Protect in early May, which serves as a real-time GenAI firewall, intercepting inputs and responses as they occur.

Detect Hallucinations, Security, and Data Privacy - No Ground Truth Required!

A big bottleneck to evaluations is the need for a test set – a bunch of use case-specific human-generated request-response pairs where the responses can be used as the ‘ground truth’ to compare model responses against. However creating a high-quality test set is often labor-intensive and expensive, typically requiring the involvement of human experts or GPT models.

With Galileo Luna, we’re happy to share that we've eliminated the need for ground truth test sets, allowing users to instantly start evaluating their LLM responses! To do this, we’ve pre-trained our EFMs on evaluation-specific datasets across a variety of domains (more on this in the next section). This innovation not only saves teams considerable time and resources, but also reduces cost significantly.

The Luna EFMs power a host of evaluation tasks (Visit our docs to read more about each of these evaluation tasks), including:

Hallucinations

Context Adherence

RAG Analytics

Chunk Attribution
Chunk Utilization
Context Relevance
Completeness

Security and Privacy

Prompt Injection Attack Detection
PII Detection
Toxicity Detection
Bias Detection

Built for Customizability: Make Luna your own!

Certain industries need extremely precise auto evaluations. For instance, when working with an AI team at a large pharma organization leading new drug discovery, it was clear that hallucinations would be catastrophic for their chances of the drug passing the next phase of clinical trials. Moreover, it was extremely hard to get expert humans to evaluate the model response. By fine-tuning a Luna EFM to the client's specific needs, the Luna EFM was able to detect a class of hallucinations with 95%+ accuracy. Every Luna model can be quickly fine-tuned with customers’ data in their cloud using Galileo’s Fine Tune product to provide an ultra-high accuracy custom evaluation model within minutes!

Developing Luna: Innovations Under the Hood

To bring Luna to life, our Research Team dedicated a year to R&D, rethinking how evaluations are conducted on the Galileo platform. Below are some of the innovations they developed. For a deeper dive, read our paper!

Purpose-Built Small Language Models for Enterprise Evaluation Tasks

Problem: As the size of language models (LLMs) continues to increase, their deployment becomes more computationally and financially demanding, which is not always necessary for specific evaluation tasks.
Solution: We tailored multi-headed Small Language Models to precisely meet the needs of bespoke evaluation tasks, focusing on optimizing them for specific evaluation criteria rather than general application.
Benefit: This customization allows our Evaluation Foundation Models (EFMs) to excel in designated use cases, delivering evaluations that are not only faster and more accurate but also more cost-effective at scale.

Intelligent Chunking Approach for Better Context Length Adaptability

Problem: Traditional methods split long inputs into short pieces for processing, which can separate related content across different segments, hampering effective hallucination detection.
Solution: We employ a dynamic windowing technique that separately splits both the input context and the response into, allowing our model to process every pair and combination of context and response chunks.
Benefit: This method ensures comprehensive validation, as every part of the response is checked against the entire context, significantly improving hallucination detection accuracy.

Multi-task Training for Enhanced Model Calibration

Problem: Traditional single-task training methods can lead to models that excel in one evaluation task but underperform in others.
Solution: Luna EFMs can conduct multiple evaluations—adherence, utilization, and relevance—using a single input, thanks to multi-task training.
Benefit: By doing this, when evaluations are being generated, EFMs can “share” insights and predictions with one another, leading to more robust and accurate evaluations.

Data Augmentation for Support Across a Broad Swath of Use Cases

Problem: Evaluations are not one-size-fits-all. Evaluation requirements in financial services vary significantly from evaluation requirements in consumer goods. Limited data diversity can restrict a model’s ability to generalize across different use cases.
Solution: Each Luna EFM has been trained on large high-quality datasets that span industries and use cases. We enrich our training dataset with both synthetic data generated by LLMs for better domain coverage and data augmentations that mimic transformations used in computer vision.
Benefit: These strategies make each model more robust and flexible, making them more effective and reliable in real-world applications.

Token Level Evaluation for Enhanced Explainability

Problem: Standard approaches may not effectively pinpoint or explain where hallucinations occur within responses.
Solution: Our model classifies each sentence as adherent or non-adherent by comparing it against every piece of the context, ensuring a piece of context supports the response part.
Benefit: This granularity allows us to show users exactly which parts of the response are hallucinated, enhancing transparency and the utility of model outputs.

Galileo Luna in Action

Starting today, Luna is available by default to all Galileo customers at no additional cost! Here are just a few of the Galileo product experiences Luna is already powering.

Intercept harmful chatbot inputs and outputs in real-time

Luna was instrumental in our latest product release, Galileo Protect, which leverages our low-latency EFMs to intercept hallucinations, prompt attacks, security threats, and more in real time. Here, Luna’s ultra-low latency is critical to ensuring a positive end-user experience.

Evaluate your GenAI system in development and production

All automatic evaluation metrics within Galileo Evaluate and Observe are powered by Luna EFMs. Users can leverage Luna EFMs for rapid experimentation and evaluation during the development phase, and seamlessly transition to using these same models for continuous monitoring in production environments.

Improve explainability with evaluation explanations

When it comes to evaluations, it’s easy to wonder “Why did I get that score?” Luna solves this by providing users with evaluation explanations out of the box. This dramatically improves explainability and streamlines root-cause analysis and debugging.

Conclusion

This marks a significant milestone in Galileo’s commitment to helping enterprises productionize GenAI with accurate, efficient, and high-performance evaluations. We are already seeing customers use Luna to launch new product offerings and transform their GenAI operations.

Whether it’s enabling real-time evaluations for chatbots handling 1 million queries per day, preventing malicious prompt injections instantly, or reducing the costs associated with production-grade GenAI, we’re excited to see the innovative solutions our customers will build with Luna Evaluation Foundation Models.

To learn more about the Luna family of Evaluation Foundation Models and see them in action, join us on June 18th for a live webinar or contact us for further information.

Low latency, low cost, high accuracy GenAI Evaluation is finally here. No more ask-GPT and painstaking vibe checks.

Since our inception in early 2021, language model evaluations have been at Galileo’s core. Evaluations are a lynchpin to productionizing and ensuring the trustworthy enterprise-wide adoption of generative AI applications. Despite this, many teams still rely on vibe checks and manual ad-hoc human evaluations to evaluate model outputs.

More recently, using LLMs like GPT-4 to evaluate other LLM outputs has become common practice, with much research [1][2] being published on the subject. Last year Galileo published and launched Chainpoll, our own chain-of-thought powered method for LLM-based hallucination evaluations that proved to be more accurate than asking GPT models.

While Chainpoll helped our customers accurately detect hallucinations, we quickly realized that in production applications, evaluations need to be ultra-low-latency and cost-efficient. With LLM-powered evaluations taking multiple seconds to run and costing tens of thousands of dollars at production scale, we needed an entirely new approach for our customers to scale evaluations.

Nearly a year of painstaking R&D later, we are excited to share Galileo Luna – low latency, low cost, high accuracy models for GenAI evaluation. No more painstaking vibe checks and asking GPT. The 5 breakthroughs in GenAI Evaluations with Galileo Luna™

Leading Evaluation Accuracy Benchmarks
Ultra Low-Cost Evaluation
Ultra Low Latency Evaluation
Detect Hallucinations, Security and Data Privacy: No Ground Truth Required!
Built for Customizability: Make Luna your own!

Luna Leads Evaluation Accuracy Benchmarks

Out of the box, when it comes to evaluation accuracy, Galileo Luna has proven to outperform all popular evaluation techniques, including our own Chainpoll methodology. The above graph is for the Context Adherence Luna Evaluation Foundation Model (EFM) which detects hallucinations in RAG-based systems. During testing against popular publicly available datasets that cover multiple industry verticals, Luna proved to be 18% more accurate than evaluating with OpenAI’s GPT-3.5 at detecting hallucinations. We are seeing similar performance for evaluation tasks such as prompt injections, PII detection, and more.

For our customers focused on hallucination prevention at enterprise scale, Luna has become the ‘first line of defense.’ While we recommend humans still remain in the loop, Luna has helped these organizations dramatically improve evaluation speed and accuracy.

Ultra Low-Cost Evaluation

The next major hurdle we looked at was cost. Luna helps AI teams reduce evaluation cost in two ways. First, Luna replaces costly LLM-based evaluations, which for some customers, exceeds $1M per year at production scale. Second, Luna helps teams reduce their reliance on human-in-the-loop evaluations. In our testing, Galileo Luna proved 97% cheaper than OpenAI’s GPT-3.5 when evaluating production traffic!

Millisecond Latencies for Evaluation

The final hurdle we overcame was latency. Low latency is vital for a seamless user experience—imagine waiting 5-seconds for a response from a chatbot! We hold our evaluations to the same rigorous standards, particularly as they become integral to the end-user experience.

To this effect, Luna EFMs have been built to evaluate LLM responses in milliseconds without compromising accuracy. In our tests for hallucination detection, Galileo proved 11x faster than using GPT-3.5. Luna was critical to launching Galileo Protect in early May, which serves as a real-time GenAI firewall, intercepting inputs and responses as they occur.

Detect Hallucinations, Security, and Data Privacy - No Ground Truth Required!

A big bottleneck to evaluations is the need for a test set – a bunch of use case-specific human-generated request-response pairs where the responses can be used as the ‘ground truth’ to compare model responses against. However creating a high-quality test set is often labor-intensive and expensive, typically requiring the involvement of human experts or GPT models.

With Galileo Luna, we’re happy to share that we've eliminated the need for ground truth test sets, allowing users to instantly start evaluating their LLM responses! To do this, we’ve pre-trained our EFMs on evaluation-specific datasets across a variety of domains (more on this in the next section). This innovation not only saves teams considerable time and resources, but also reduces cost significantly.

The Luna EFMs power a host of evaluation tasks (Visit our docs to read more about each of these evaluation tasks), including:

Hallucinations

Context Adherence

RAG Analytics

Chunk Attribution
Chunk Utilization
Context Relevance
Completeness

Security and Privacy

Prompt Injection Attack Detection
PII Detection
Toxicity Detection
Bias Detection

Built for Customizability: Make Luna your own!

Certain industries need extremely precise auto evaluations. For instance, when working with an AI team at a large pharma organization leading new drug discovery, it was clear that hallucinations would be catastrophic for their chances of the drug passing the next phase of clinical trials. Moreover, it was extremely hard to get expert humans to evaluate the model response. By fine-tuning a Luna EFM to the client's specific needs, the Luna EFM was able to detect a class of hallucinations with 95%+ accuracy. Every Luna model can be quickly fine-tuned with customers’ data in their cloud using Galileo’s Fine Tune product to provide an ultra-high accuracy custom evaluation model within minutes!

Developing Luna: Innovations Under the Hood

To bring Luna to life, our Research Team dedicated a year to R&D, rethinking how evaluations are conducted on the Galileo platform. Below are some of the innovations they developed. For a deeper dive, read our paper!

Purpose-Built Small Language Models for Enterprise Evaluation Tasks

Problem: As the size of language models (LLMs) continues to increase, their deployment becomes more computationally and financially demanding, which is not always necessary for specific evaluation tasks.
Solution: We tailored multi-headed Small Language Models to precisely meet the needs of bespoke evaluation tasks, focusing on optimizing them for specific evaluation criteria rather than general application.
Benefit: This customization allows our Evaluation Foundation Models (EFMs) to excel in designated use cases, delivering evaluations that are not only faster and more accurate but also more cost-effective at scale.

Intelligent Chunking Approach for Better Context Length Adaptability

Problem: Traditional methods split long inputs into short pieces for processing, which can separate related content across different segments, hampering effective hallucination detection.
Solution: We employ a dynamic windowing technique that separately splits both the input context and the response into, allowing our model to process every pair and combination of context and response chunks.
Benefit: This method ensures comprehensive validation, as every part of the response is checked against the entire context, significantly improving hallucination detection accuracy.

Multi-task Training for Enhanced Model Calibration

Problem: Traditional single-task training methods can lead to models that excel in one evaluation task but underperform in others.
Solution: Luna EFMs can conduct multiple evaluations—adherence, utilization, and relevance—using a single input, thanks to multi-task training.
Benefit: By doing this, when evaluations are being generated, EFMs can “share” insights and predictions with one another, leading to more robust and accurate evaluations.

Data Augmentation for Support Across a Broad Swath of Use Cases

Problem: Evaluations are not one-size-fits-all. Evaluation requirements in financial services vary significantly from evaluation requirements in consumer goods. Limited data diversity can restrict a model’s ability to generalize across different use cases.
Solution: Each Luna EFM has been trained on large high-quality datasets that span industries and use cases. We enrich our training dataset with both synthetic data generated by LLMs for better domain coverage and data augmentations that mimic transformations used in computer vision.
Benefit: These strategies make each model more robust and flexible, making them more effective and reliable in real-world applications.

Token Level Evaluation for Enhanced Explainability

Problem: Standard approaches may not effectively pinpoint or explain where hallucinations occur within responses.
Solution: Our model classifies each sentence as adherent or non-adherent by comparing it against every piece of the context, ensuring a piece of context supports the response part.
Benefit: This granularity allows us to show users exactly which parts of the response are hallucinated, enhancing transparency and the utility of model outputs.