Ensuring task-alignment of generative AI applications requires innovative evaluation methods. While automated metrics powered by LLM-as-a-Judge offer compelling cost and speed advantages, they often lack the nuanced domain expertise of human evaluation.
But what if there was a better way to leverage the power of automation while integrating domain expertise when evaluating generative AI applications?
This is why we’re so excited to announce the public release of Continuous Learning with Human Feedback (CLHF) on the Galileo Evaluation Platform—a breakthrough workflow that enables domain-specific tuning of generic LLM-as-a-Judge evaluation metrics with as few as five annotated records.
Our internal benchmarks show that CLHF increases the accuracy of both well-researched metrics as well as metrics generated with simple prompts without rigorous prompt engineering by upwards of 30%.
This reduces the time to build a custom metric from weeks to minutes, unlocking the ability for enterprises to rapidly build tailored metrics for their use cases. It also compounds the value of both human and automated evals, creating a positive flywheel effect that drives continuous improvement of metrics.
The Galileo Platform includes the Luna Evaluation Suite, a set of research-backed metrics powered by LLMs And SLMs, proven to show robust out-of-the-box performance. However, specialized tasks—like detecting bias in a customer-support assistant—can challenge even sophisticated foundation metrics.
One of the components guiding these metrics is specialized prompts, refined to perform well for the general purpose detection of things like factuality, contextual relevance, and bias. However, these prompts were not tailored to any specific organization.
Traditionally, refining these metrics meant hours of manual prompt engineering or longer fine-tuning efforts. Our breakthrough approach simplifies this: a short human annotation queue allows domain experts to quickly tune metric performance. By providing a few labeled records of qualitative feedback, organizations can automatically recalibrate evaluation prompts to their specific context.
The result is remarkable: typically 20-25% improvement in metric accuracy, achieved through an effortless, intelligent adaptation process that transforms generic metrics into precision instruments for your unique use case.
CLHF is now generally available to all users of the Galileo Platform. For platform access, join the waitlist today. Once logged in, follow the steps below to experiment with CLHF.
1. Create an evaluate run using your preferred metric from the Galileo Platform.
2. Manually browse through run performance to find some rows where you strongly agree or disagree with the quantitative score and/or the qualitative explanation.
3. Click on the score explanation where you can now select ‘Auto-improve Metric’ to start providing feedback.
4. A new feedback panel will pop up where you can indicate whether you agree or disagree with the record by providing ‘qualitative feedback’ and explaining your reasoning. When done, click submit for auto learning.
Note: positive reinforcement is often as important as negative reinforcement to ensure you are both correcting for bad habits and reinforcing correct behavior.
5. Now select Re-tune. Behind the scenes, your feedback is given to an LLM that automatically generates few-shot examples and re-writes the prompt. The new generated prompt will include instructions tailored to your use case and specific examples on your data. Both are proven to increase the accuracy of metrics.
6. Congrats! A new, improved metric is now available for future evaluation runs. If you select to recalculate the new metric for your existing runs, you should see scores and explanations update shortly.
CLHF simplifies the process of generating custom metrics tailored to your organization.By unifying human and automated evaluations on a single platform, AI teams fully unlock the potential of their AI applications.
Embark on your journey to better GenAI and get in touch with the Galileo team for platform access.
Table of contents