Generative AI has moved beyond crunching numbers and is creating and imagining things like never before. But with this power comes new challenges when it comes to evaluating these systems.
On a recent "Chain of Thought" podcast episode, Conor Bronsdon, Head of Developer Awareness at Galileo; Vikram Chatterji, Co-founder and CEO of Galileo, shared their insights on what makes assessing Generative AI so unique.
Evaluating Generative AI isn’t the same as testing your usual software. Because these systems are smarter and more complex, you need a different approach. It’s not just about whether something works, but how well it creates.
You need to use new effective AI evaluation methods that dive deeper into what the AI is producing.
With traditional software, it’s usually a clear-cut process. You have rules and outcomes that are easy to measure. Generative AI, especially large language models, don’t follow a set path. They’re unpredictable.
So the old metrics just don’t cut it. Chatterji points out, "you have to have a training set. You have to have a test set. You have to compare the two over time." It’s all about keeping up with what’s changing.
Plus, a lot of developers are jumping into AI without a deep data science background. With over 30 million developers using AI models through APIs, everyone needs to adopt a more data-focused approach in their evaluations.
Generative AI needs its own set of rules to be evaluated properly, and practical evaluation tips can help. We’re talking about things like how good the prompts are, which model you choose, and how you use vector stores—stuff that wasn’t on our radar with traditional software.
When you bring in multi-agent workflows and models that handle different types of data, evaluation gets even trickier. Bronsdon breaks it down: "Builders want to evaluate three parts of that agentic workflow... the right tool chosen and used correctly at each step."
At the end of the day, creating objective criteria for Generative AI is tough. The tasks these AIs perform can be wide-ranging and open-ended, making outcomes seem subjective. Evaluations are just hard, which highlights the ongoing challenge of setting standards for accuracy and effectiveness.
When evaluating AI agents, you have to check their work at multiple levels. AI agents use models to tackle tasks step by step, meaning you have to check their work at multiple levels. AI agents use LLMs to plan their actions and often incorporate multi-turn or multi-agent workflows.
Evaluating this layered process requires checking each step, turn, and session to ensure everything runs smoothly.
In systems where AI acts as an agent, every stage of the workflow involves important decisions that need to be checked. Builders want to evaluate at three parts: Was the right tool chosen and used correctly at each step?... Were the steps performed in the correct order?... Is the final result accurate?.
Breaking it down like this helps make sure the AI is working right and producing reliable results. But even with all this potential, surveys show that less than 15% of AI builders are actually using and evaluating these technologies effectively.
Evaluating AI agents isn’t simple. It requires advanced metrics, AI evaluation tools, and logging systems that can handle different situations. You can't rely on traditional metrics such as cost and latency. New metrics and testing methodologies are needed.
AI builders should create their own evaluation frameworks tailored to the specific AI applications they’re working with. That means setting up clear guidelines and robust logging systems that track every step the agent takes—which API call was made, what the responses looked like, and more.
Such a tailored approach is key. It’s not about one-size-fits-all metrics, but about what fits best for each particular AI system. It requires constantly updating and refining how you evaluate things to keep up with the changing capabilities of AI.
Chatterji points out, "The part which becomes interesting... is for your specific agent, how do you figure out what the right kind of metrics are," encouraging a move towards evaluation-driven development to better integrate AI into businesses.
Generative AI is always changing, and evaluating these models is both important and tough. These models can be opaque, after all. So how do you assess something when you can’t predict what’s going to happen?
The big question is: how do you create objective criteria for Generative AI? Traditional software metrics like cost and speed aren’t enough. These metrics don't exist. We need new ways to evaluate AI that are built just for it.
One approach is to build an effective LLM evaluation framework from scratch. You have to figure out what's a good test set that you can actually use, what are the right metrics that you can actually use.
It’s a new challenge, but necessary for getting accurate evaluations.
Adding humans into the evaluation process is crucial. Human-in-the-loop (HITL) will never go away, especially because AI can produce unpredictable results. Humans are needed to understand subtle responses and handle unexpected cases.
With edge cases somebody with semantic understanding of the topic is critical. With humans overseeing AI, we can fine-tune outputs to better fit specific needs and maintain compliance.
To wrap it up, evaluating AI systems is naturally subjective because these systems aren't always predictable. But it’s essential for building strong and reliable AI. By focusing on guidelines, specific use cases, and human oversight, we can bridge the gap between what humans need and what machines can do.
As AI continues to evolve, improving these evaluation methods will lead to better observability and security, sparking more innovation in AI development.
In AI, "hallucinations" happen when models create responses that don’t match the input data or real-world context. Such occurrences are common in generative AI, as seen in studies like the recent AI hallucinations survey, where the output can sound plausible but is actually wrong or irrelevant.
"A hallucination is basically something that the model is coming up with, and it might not be something that makes a lot of sense," explains Chatterji.
Hallucinations can occur in both closed and open-book AI systems. In closed-book systems, which don’t use external data, hallucinations might happen because the model tries to fill in gaps with its own ideas.
In open-book systems, which access outside information, hallucinations can occur if the needed data is missing or retrieved incorrectly, causing responses that don’t fit the context.
While some creative uses might embrace these unexpected outputs, in fields where accuracy matters—like healthcare or finance—hallucinations are a big problem.
To keep AI hallucinations in check, clear contexts and strong oversight are key. Implementing a framework for detecting hallucinations can help reduce mistakes, especially in open-book systems where the quality of retrieved data matters. Grounding AI models in accurate, domain-specific knowledge can lower the chances of generating irrelevant or wrong information.
Having humans oversee AI systems is also crucial. Human-in-the-loop approaches let experts check outputs, particularly in high-stakes areas.
This oversight ensures AI follows the right guidelines and meets AI compliance standards, keeping AI applications reliable across different industries.
In short, tackling AI hallucinations takes a layered approach: defining clear contexts, integrating quality data, and maintaining human oversight. This way, industries can make the most of AI while avoiding its potential downsides.
Human-in-the-loop (HITL) systems are crucial when evaluating AI. Even though AI is powerful, it often hits unknown edge cases where it falls short. In the debate between LLM evaluation vs human, human intervention helps refine metrics and test sets, filling in the gaps where AI might struggle.
Generative AI models bring unique challenges compared to traditional software. Evaluations have always been a science. As AI becomes more widespread and easier to implement through APIs, there's a growing need for diverse expertise in the evaluation process.
Human evaluators play a key role in handling complex scenarios that AI might miss. Just like software quality testing relies on human insights to catch nuances and potential problems, AI evaluation benefits greatly from human input.
Chatterji points out that while software engineers focus on architecture and data use, AI’s unpredictable nature—the fact that there are no predefined metrics for quality beyond cost and latency—means humans need to step in to spot and fix errors.
In AI evaluation, humans add a level of scrutiny that algorithms alone can’t provide. They make sure AI systems perform as expected by checking real-world applicability and compliance with industry standards.
Such a partnership ensures that AI applications are reliable, user-friendly, and ethically sound.
As we navigate the complexities of evaluating Generative AI, observability tools like Galileo become essential. Galileo provides robust solutions for evaluating, monitoring best practices, and protecting AI applications, streamlining workflows to ensure consistency and high-quality outputs.
Learn more about how Galileo can enhance your AI initiatives.
Also, listen to the entire episode on the "Chain of Thought" podcast where Galileo’s Co-founder and CTO Atin Sanyal joins Chip Huyen (Storyteller, Tép Studio) and Vivienne Zhang (Senior Product Manager, Generative AI Software, Nvidia) to dive deeper into the practical lessons learned from GenAI evaluations.