Introducing Galileo Luna™: A Family of Evaluation Foundation Models

Vikram Chatterji
Vikram ChatterjiCEO
Introducing Galileo Luna – the new standard for enterprise GenAI evaluations
6 min readJune 06 2024

Low latency, low cost, high accuracy GenAI Evaluation is finally here. No more ask-GPT and painstaking vibe checks.

Since our inception in early 2021, language model evaluations have been at Galileo’s core. Evaluations are a lynchpin to productionizing and ensuring the trustworthy enterprise-wide adoption of generative AI applications. Despite this, many teams still rely on vibe checks and manual ad-hoc human evaluations to evaluate model outputs.

More recently, using LLMs like GPT-4 to evaluate other LLM outputs has become common practice, with much research [1][2] being published on the subject. Last year Galileo published and launched Chainpoll, our own chain-of-thought powered method for LLM-based hallucination evaluations that proved to be more accurate than asking GPT models.

While Chainpoll helped our customers accurately detect hallucinations, we quickly realized that in production applications, evaluations need to be ultra-low-latency and cost-efficient. With LLM-powered evaluations taking multiple seconds to run and costing tens of thousands of dollars at production scale, we needed an entirely new approach for our customers to scale evaluations.

Nearly a year of painstaking R&D later, we are excited to share Galileo Luna – low latency, low cost, high accuracy models for GenAI evaluation. No more painstaking vibe checks and asking GPT.
The 5 breakthroughs in GenAI Evaluations with Galileo Luna

  1. Leading Evaluation Accuracy Benchmarks
  2. Ultra Low-Cost Evaluation
  3. Ultra Low Latency Evaluation
  4. Detect Hallucinations, Security and Data Privacy: No Ground Truth Required!
  5. Built for Customizability: Make Luna your own!

Luna Leads Evaluation Accuracy Benchmarks

Luna is 18% more accurate than GPT-3.5
Luna is 18% more accurate than GPT-3.5

Out of the box, when it comes to evaluation accuracy, Galileo Luna has proven to outperform all popular evaluation techniques, including our own Chainpoll methodology. The above graph is for the Context Adherence Luna Evaluation Foundation Model (EFM) which detects hallucinations in RAG-based systems. During testing against popular publicly available datasets that cover multiple industry verticals, Luna proved to be 18% more accurate than evaluating with OpenAI’s GPT-3.5 at detecting hallucinations. We are seeing similar performance for evaluation tasks such as prompt injections, PII detection, and more.

For our customers focused on hallucination prevention at enterprise scale, Luna has become the ‘first line of defense.’ While we recommend humans still remain in the loop, Luna has helped these organizations dramatically improve evaluation speed and accuracy.

Ultra Low-Cost Evaluation

Luna is 97% cheaper than GPT-3.5. (Experiment conditions: 1 QPS traffic volume; 4k input token length; using Nvidia L4 GPU)
Luna is 97% cheaper than GPT-3.5. (Experiment conditions: 1 QPS traffic volume; 4k input token length; using Nvidia L4 GPU)

The next major hurdle we looked at was cost. Luna helps AI teams reduce evaluation cost in two ways. First, Luna replaces costly LLM-based evaluations, which for some customers, exceeds $1M per year at production scale. Second, Luna helps teams reduce their reliance on human-in-the-loop evaluations. In our testing, Galileo Luna proved 97% cheaper than OpenAI’s GPT-3.5 when evaluating production traffic!

Millisecond Latencies for Evaluation

Luna is 11x faster than GPT-3.5. (Experiment conditions: 4k input token length; Using Nvidia L4 GPU)
Luna is 11x faster than GPT-3.5. (Experiment conditions: 4k input token length; Using Nvidia L4 GPU)

The final hurdle we overcame was latency. Low latency is vital for a seamless user experience—imagine waiting 5-seconds for a response from a chatbot! We hold our evaluations to the same rigorous standards, particularly as they become integral to the end-user experience.

To this effect, Luna EFMs have been built to evaluate LLM responses in milliseconds without compromising accuracy. In our tests for hallucination detection, Galileo proved 11x faster than using GPT-3.5. Luna was critical to launching Galileo Protect in early May, which serves as a real-time GenAI firewall, intercepting inputs and responses as they occur.

Detect Hallucinations, Security, and Data Privacy - No Ground Truth Required!

A big bottleneck to evaluations is the need for a test set – a bunch of use case-specific human-generated request-response pairs where the responses can be used as the ‘ground truth’ to compare model responses against. However creating a high-quality test set is often labor-intensive and expensive, typically requiring the involvement of human experts or GPT models.

With Galileo Luna, we’re happy to share that we've eliminated the need for ground truth test sets, allowing users to instantly start evaluating their LLM responses! To do this, we’ve pre-trained our EFMs on evaluation-specific datasets across a variety of domains (more on this in the next section). This innovation not only saves teams considerable time and resources, but also reduces cost significantly.


The Luna EFMs power a host of evaluation tasks (Visit our docs to read more about each of these evaluation tasks), including:

Hallucinations

  • Context Adherence

RAG Analytics

  • Chunk Attribution
  • Chunk Utilization
  • Context Relevance
  • Completeness

Security and Privacy

  • Prompt Injection Attack Detection
  • PII Detection
  • Toxicity Detection
  • Bias Detection

Built for Customizability: Make Luna your own!

Certain industries need extremely precise auto evaluations. For instance, when working with an AI team at a large pharma organization leading new drug discovery, it was clear that hallucinations would be catastrophic for their chances of the drug passing the next phase of clinical trials. Moreover, it was extremely hard to get expert humans to evaluate the model response. By fine-tuning a Luna EFM to the client's specific needs, the Luna EFM was able to detect a class of hallucinations with 95%+ accuracy.

Every Luna model can be quickly fine-tuned with customers’ data in their cloud using Galileo’s Fine Tune product to provide an ultra-high accuracy custom evaluation model within minutes!

Developing Luna: Innovations Under the Hood

To bring Luna to life, our Research Team dedicated a year to R&D, rethinking how evaluations are conducted on the Galileo platform. Below are some of the innovations they developed. For a deeper dive, read our paper!

Purpose-Built Small Language Models for Enterprise Evaluation Tasks

  • Problem: As the size of language models (LLMs) continues to increase, their deployment becomes more computationally and financially demanding, which is not always necessary for specific evaluation tasks.
  • Solution: We tailored multi-headed Small Language Models to precisely meet the needs of bespoke evaluation tasks, focusing on optimizing them for specific evaluation criteria rather than general application.
  • Benefit: This customization allows our Evaluation Foundation Models (EFMs) to excel in designated use cases, delivering evaluations that are not only faster and more accurate but also more cost-effective at scale.

Intelligent Chunking Approach for Better Context Length Adaptability

  • Problem: Traditional methods split long inputs into short pieces for processing, which can separate related content across different segments, hampering effective hallucination detection.
  • Solution: We employ a dynamic windowing technique that separately splits both the input context and the response into, allowing our model to process every pair and combination of context and response chunks.
  • Benefit: This method ensures comprehensive validation, as every part of the response is checked against the entire context, significantly improving hallucination detection accuracy.

Multi-task Training for Enhanced Model Calibration:

  • Problem: Traditional single-task training methods can lead to models that excel in one evaluation task but underperform in others.
  • Solution: Luna EFMs can conduct multiple evaluations—adherence, utilization, and relevance—using a single input, thanks to multi-task training.
  • Benefit: By doing this, when evaluations are being generated, EFMs can “share” insights and predictions with one another, leading to more robust and accurate evaluations.

Data Augmentation for Support Across a Broad Swath of Use Cases:

  • Problem: Evaluations are not one-size-fits-all. Evaluation requirements in financial services vary significantly from evaluation requirements in consumer goods. Limited data diversity can restrict a model’s ability to generalize across different use cases.
  • Solution: Each Luna EFM has been trained on large high-quality datasets that span industries and use cases. We enrich our training dataset with both synthetic data generated by LLMs for better domain coverage and data augmentations that mimic transformations used in computer vision.
  • Benefit: These strategies make each model more robust and flexible, making them more effective and reliable in real-world applications.

Token Level Evaluation for Enhanced Explainability

  • Problem: Standard approaches may not effectively pinpoint or explain where hallucinations occur within responses.
  • Solution: Our model classifies each sentence as adherent or non-adherent by comparing it against every piece of the context, ensuring a piece of context supports the response part.
  • Benefit: This granularity allows us to show users exactly which parts of the response are hallucinated, enhancing transparency and the utility of model outputs.
Users can see exactly which parts of the response are hallucinated, enhancing transparency and utility of model outputs.
Users can see exactly which parts of the response are hallucinated, enhancing transparency and utility of model outputs.

Galileo Luna in Action

Starting today, Luna is available by default to all Galileo customers at no additional cost! Here are just a few of the Galileo product experiences Luna is already powering.

Intercept harmful chat bot inputs and outputs in real-time

Luna was instrumental in our latest product release, Galileo Protect, which leverages our low-latency EFMs to intercept hallucinations, prompt attacks, security threats, and more in real time. Here, Luna’s ultra-low latency is critical to ensuring a positive end-user experience.

Galileo Protect uses Luna to intercept hallucinations in real-time.
Galileo Protect uses Luna to intercept hallucinations in real-time.

Evaluate your GenAI system in development and production

All automatic evaluation metrics within Galileo Evaluate and Observe are powered by Luna EFMs. Users can leverage Luna EFMs for rapid experimentation and evaluation during the development phase, and seamlessly transition to using these same models for continuous monitoring in production environments.

Users can leverage Luna for rapid experimentation and evaluation during the development phase, and seamlessly transition to using these same models for continuous monitoring in production environments.
Users can leverage Luna for rapid experimentation and evaluation during the development phase, and seamlessly transition to using these same models for continuous monitoring in production environments.

Improve explainability with evaluation explanations

When it comes to evaluations, it’s easy to wonder “Why did I get that score?” Luna solves this by providing users with evaluation explanations out of the box. This dramatically improves explainability and streamlines root-cause analysis and debugging.

Users receive evaluation explanations out of the box, dramatically improving root-cause analysis and debugging.
Users receive evaluation explanations out of the box, dramatically improving root-cause analysis and debugging.

Conclusion

This marks a significant milestone in Galileo’s commitment to helping enterprises productionize GenAI with accurate, efficient, and high-performance evaluations. We are already seeing customers use Luna to launch new product offerings and transform their GenAI operations.

Whether it’s enabling real-time evaluations for chatbots handling 1 million queries per day, preventing malicious prompt injections instantly, or reducing the costs associated with production-grade GenAI, we’re excited to see the innovative solutions our customers will build with Luna Evaluation Foundation Models.


To learn more about the Luna family of Evaluation Foundation Models and see them in action, join us on June 18th for a live webinar or contact us for further information.