Table of contents
Low latency, low cost, high accuracy GenAI Evaluation is finally here. No more ask-GPT and painstaking vibe checks.
Since our inception in early 2021, language model evaluations have been at Galileo’s core. Evaluations are a lynchpin to productionizing and ensuring the trustworthy enterprise-wide adoption of generative AI applications. Despite this, many teams still rely on vibe checks and manual ad-hoc human evaluations to evaluate model outputs.
More recently, using LLMs like GPT-4 to evaluate other LLM outputs has become common practice, with much research [1][2] being published on the subject. Last year Galileo published and launched Chainpoll, our own chain-of-thought powered method for LLM-based hallucination evaluations that proved to be more accurate than asking GPT models.
While Chainpoll helped our customers accurately detect hallucinations, we quickly realized that in production applications, evaluations need to be ultra-low-latency and cost-efficient. With LLM-powered evaluations taking multiple seconds to run and costing tens of thousands of dollars at production scale, we needed an entirely new approach for our customers to scale evaluations.
Nearly a year of painstaking R&D later, we are excited to share Galileo Luna – low latency, low cost, high accuracy models for GenAI evaluation. No more painstaking vibe checks and asking GPT.
The 5 breakthroughs in GenAI Evaluations with Galileo Luna™
Out of the box, when it comes to evaluation accuracy, Galileo Luna has proven to outperform all popular evaluation techniques, including our own Chainpoll methodology. The above graph is for the Context Adherence Luna Evaluation Foundation Model (EFM) which detects hallucinations in RAG-based systems. During testing against popular publicly available datasets that cover multiple industry verticals, Luna proved to be 18% more accurate than evaluating with OpenAI’s GPT-3.5 at detecting hallucinations. We are seeing similar performance for evaluation tasks such as prompt injections, PII detection, and more.
For our customers focused on hallucination prevention at enterprise scale, Luna has become the ‘first line of defense.’ While we recommend humans still remain in the loop, Luna has helped these organizations dramatically improve evaluation speed and accuracy.
The next major hurdle we looked at was cost. Luna helps AI teams reduce evaluation cost in two ways. First, Luna replaces costly LLM-based evaluations, which for some customers, exceeds $1M per year at production scale. Second, Luna helps teams reduce their reliance on human-in-the-loop evaluations. In our testing, Galileo Luna proved 97% cheaper than OpenAI’s GPT-3.5 when evaluating production traffic!
The final hurdle we overcame was latency. Low latency is vital for a seamless user experience—imagine waiting 5-seconds for a response from a chatbot! We hold our evaluations to the same rigorous standards, particularly as they become integral to the end-user experience.
To this effect, Luna EFMs have been built to evaluate LLM responses in milliseconds without compromising accuracy. In our tests for hallucination detection, Galileo proved 11x faster than using GPT-3.5. Luna was critical to launching Galileo Protect in early May, which serves as a real-time GenAI firewall, intercepting inputs and responses as they occur.
A big bottleneck to evaluations is the need for a test set – a bunch of use case-specific human-generated request-response pairs where the responses can be used as the ‘ground truth’ to compare model responses against. However creating a high-quality test set is often labor-intensive and expensive, typically requiring the involvement of human experts or GPT models.
With Galileo Luna, we’re happy to share that we've eliminated the need for ground truth test sets, allowing users to instantly start evaluating their LLM responses! To do this, we’ve pre-trained our EFMs on evaluation-specific datasets across a variety of domains (more on this in the next section). This innovation not only saves teams considerable time and resources, but also reduces cost significantly.
The Luna EFMs power a host of evaluation tasks (Visit our docs to read more about each of these evaluation tasks), including:
Hallucinations
RAG Analytics
Security and Privacy
Certain industries need extremely precise auto evaluations. For instance, when working with an AI team at a large pharma organization leading new drug discovery, it was clear that hallucinations would be catastrophic for their chances of the drug passing the next phase of clinical trials. Moreover, it was extremely hard to get expert humans to evaluate the model response. By fine-tuning a Luna EFM to the client's specific needs, the Luna EFM was able to detect a class of hallucinations with 95%+ accuracy.
Every Luna model can be quickly fine-tuned with customers’ data in their cloud using Galileo’s Fine Tune product to provide an ultra-high accuracy custom evaluation model within minutes!
To bring Luna to life, our Research Team dedicated a year to R&D, rethinking how evaluations are conducted on the Galileo platform. Below are some of the innovations they developed. For a deeper dive, read our paper!
Purpose-Built Small Language Models for Enterprise Evaluation Tasks
Intelligent Chunking Approach for Better Context Length Adaptability
Multi-task Training for Enhanced Model Calibration:
Data Augmentation for Support Across a Broad Swath of Use Cases:
Token Level Evaluation for Enhanced Explainability
Starting today, Luna is available by default to all Galileo customers at no additional cost! Here are just a few of the Galileo product experiences Luna is already powering.
Intercept harmful chat bot inputs and outputs in real-time
Luna was instrumental in our latest product release, Galileo Protect, which leverages our low-latency EFMs to intercept hallucinations, prompt attacks, security threats, and more in real time. Here, Luna’s ultra-low latency is critical to ensuring a positive end-user experience.
Evaluate your GenAI system in development and production
All automatic evaluation metrics within Galileo Evaluate and Observe are powered by Luna EFMs. Users can leverage Luna EFMs for rapid experimentation and evaluation during the development phase, and seamlessly transition to using these same models for continuous monitoring in production environments.
Improve explainability with evaluation explanations
When it comes to evaluations, it’s easy to wonder “Why did I get that score?” Luna solves this by providing users with evaluation explanations out of the box. This dramatically improves explainability and streamlines root-cause analysis and debugging.
This marks a significant milestone in Galileo’s commitment to helping enterprises productionize GenAI with accurate, efficient, and high-performance evaluations. We are already seeing customers use Luna to launch new product offerings and transform their GenAI operations.
Whether it’s enabling real-time evaluations for chatbots handling 1 million queries per day, preventing malicious prompt injections instantly, or reducing the costs associated with production-grade GenAI, we’re excited to see the innovative solutions our customers will build with Luna Evaluation Foundation Models.
To learn more about the Luna family of Evaluation Foundation Models and see them in action, join us on June 18th for a live webinar or contact us for further information.
Table of contents