Evaluating Generative AI

Generative AI has moved beyond crunching numbers and is creating and imagining things like never before. But with this power comes new challenges when it comes to evaluating these systems.

On a recent "Chain of Thought" podcast episode, Conor Bronsdon, Head of Developer Awareness at Galileo; Vikram Chatterji, Co-founder and CEO of Galileo, shared their insights on what makes assessing Generative AI so unique.

We recently explored this topic on our Chain of Thought podcast, where industry experts shared practical insights and real-world implementation strategies.

Understanding the Challenges of Evaluating Generative AI

Evaluating Generative AI isn’t the same as testing your usual software. Because these systems are smarter and more complex, you need a different approach. It’s not just about whether something works, but how well it creates.

You need to use new effective AI evaluation methods that dive deeper into what the AI is producing.

Differences from Traditional Software Evaluation

With traditional software, it’s usually a clear-cut process. You have rules and outcomes that are easy to measure. Generative AI, especially large language models, don’t follow a set path. They’re unpredictable.

So the old metrics just don’t cut it. Chatterji points out, "you have to have a training set. You have to have a test set. You have to compare the two over time." It’s all about keeping up with what’s changing.

Plus, a lot of developers are jumping into AI without a deep data science background. With over 30 million developers using AI models through APIs, everyone needs to adopt a more data-focused approach in their evaluations.

Need for New Metrics and Evaluation Methods

Generative AI needs its own set of rules to be evaluated properly, and practical evaluation tips can help. We’re talking about things like how good the prompts are, which model you choose, and how you use vector stores—stuff that wasn’t on our radar with traditional software.

When you bring in multi-agent workflows and models that handle different types of data, evaluation gets even trickier. Bronsdon breaks it down: "Builders want to evaluate three parts of that agentic workflow... the right tool chosen and used correctly at each step."

At the end of the day, creating objective criteria for Generative AI is tough. The tasks these AIs perform can be wide-ranging and open-ended, making outcomes seem subjective. Evaluations are just hard, which highlights the ongoing challenge of setting standards for accuracy and effectiveness.

Learn how to create powerful, reliable AI agents with our in-depth eBook.

The Role of AI Agents in Evaluation Complexity

When evaluating AI agents, you have to check their work at multiple levels. AI agents use models to tackle tasks step by step, meaning you have to check their work at multiple levels. AI agents use LLMs to plan their actions and often incorporate multi-turn or multi-agent workflows.

Evaluating this layered process requires checking each step, turn, and session to ensure everything runs smoothly.

AI Agents and LLMs in Workflow

In systems where AI acts as an agent, every stage of the workflow involves important decisions that need to be checked. Builders want to evaluate at three parts: Was the right tool chosen and used correctly at each step?... Were the steps performed in the correct order?... Is the final result accurate?.

Breaking it down like this helps make sure the AI is working right and producing reliable results. But even with all this potential, surveys show that less than 15% of AI builders are actually using and evaluating these technologies effectively.

Metrics and Tools for Evaluating AI Agents

Evaluating AI agents isn’t simple. It requires advanced metrics, AI evaluation tools, and logging systems that can handle different situations. You can't rely on traditional metrics such as cost and latency. New metrics and testing methodologies are needed.

AI builders should create their own evaluation frameworks tailored to the specific AI applications they’re working with. That means setting up clear guidelines and robust logging systems that track every step the agent takes—which API call was made, what the responses looked like, and more.

Such a tailored approach is key. It’s not about one-size-fits-all metrics, but about what fits best for each particular AI system. It requires constantly updating and refining how you evaluate things to keep up with the changing capabilities of AI.

Chatterji points out, "The part which becomes interesting... is for your specific agent, how do you figure out what the right kind of metrics are," encouraging a move towards evaluation-driven development to better integrate AI into businesses.

Prioritizing Evaluation Metrics in Non-Deterministic Systems

Generative AI is always changing, and evaluating these models is both important and tough. These models can be opaque, after all. So how do you assess something when you can’t predict what’s going to happen?

Objective Criteria in AI Evaluation

The big question is: how do you create objective criteria for Generative AI? Traditional software metrics like cost and speed aren’t enough. These metrics don't exist. We need new ways to evaluate AI that are built just for it.

One approach is to build an effective LLM evaluation framework from scratch. You have to figure out what's a good test set that you can actually use, what are the right metrics that you can actually use.

It’s a new challenge, but necessary for getting accurate evaluations.

Human-in-the-Loop Considerations

Adding humans into the evaluation process is crucial. Human-in-the-loop (HITL) will never go away, especially because AI can produce unpredictable results. Humans are needed to understand subtle responses and handle unexpected cases.

With edge cases somebody with semantic understanding of the topic is critical. With humans overseeing AI, we can fine-tune outputs to better fit specific needs and maintain compliance.

To wrap it up, evaluating AI systems is naturally subjective because these systems aren't always predictable. But it’s essential for building strong and reliable AI. By focusing on guidelines, specific use cases, and human oversight, we can bridge the gap between what humans need and what machines can do.

As AI continues to evolve, improving these evaluation methods will lead to better observability and security, sparking more innovation in AI development.

Dealing with Hallucinations in AI Systems

Understanding AI Hallucinations

In AI, "hallucinations" happen when models create responses that don’t match the input data or real-world context. Such occurrences are common in generative AI, as seen in studies like the recent AI hallucinations survey, where the output can sound plausible but is actually wrong or irrelevant.

"A hallucination is basically something that the model is coming up with, and it might not be something that makes a lot of sense," explains Chatterji.

Hallucinations can occur in both closed and open-book AI systems. In closed-book systems, which don’t use external data, hallucinations might happen because the model tries to fill in gaps with its own ideas.

In open-book systems, which access outside information, hallucinations can occur if the needed data is missing or retrieved incorrectly, causing responses that don’t fit the context.

While some creative uses might embrace these unexpected outputs, in fields where accuracy matters—like healthcare or finance—hallucinations are a big problem.

Strategies to Manage Hallucinations

To keep AI hallucinations in check, clear contexts and strong oversight are key. Implementing a framework for detecting hallucinations can help reduce mistakes, especially in open-book systems where the quality of retrieved data matters. Grounding AI models in accurate, domain-specific knowledge can lower the chances of generating irrelevant or wrong information.

Having humans oversee AI systems is also crucial. Human-in-the-loop approaches let experts check outputs, particularly in high-stakes areas.

This oversight ensures AI follows the right guidelines and meets AI compliance standards, keeping AI applications reliable across different industries.

In short, tackling AI hallucinations takes a layered approach: defining clear contexts, integrating quality data, and maintaining human oversight. This way, industries can make the most of AI while avoiding its potential downsides.

The Importance of Human-in-the-Loop Systems

Human-in-the-loop (HITL) systems are crucial when evaluating AI. Even though AI is powerful, it often hits unknown edge cases where it falls short. In the debate between LLM evaluation vs human, human intervention helps refine metrics and test sets, filling in the gaps where AI might struggle.

Generative AI models bring unique challenges compared to traditional software. Evaluations have always been a science. As AI becomes more widespread and easier to implement through APIs, there's a growing need for diverse expertise in the evaluation process.

Human evaluators play a key role in handling complex scenarios that AI might miss. Just like software quality testing relies on human insights to catch nuances and potential problems, AI evaluation benefits greatly from human input.

Chatterji points out that while software engineers focus on architecture and data use, AI’s unpredictable nature—the fact that there are no predefined metrics for quality beyond cost and latency—means humans need to step in to spot and fix errors.

In AI evaluation, humans add a level of scrutiny that algorithms alone can’t provide. They make sure AI systems perform as expected by checking real-world applicability and compliance with industry standards.

Such a partnership ensures that AI applications are reliable, user-friendly, and ethically sound.

The AI Measurement Imperative

As we navigate the complexities of evaluating Generative AI, observability tools like Galileo become essential. Galileo provides robust solutions for evaluating, monitoring best practices, and protecting AI applications, streamlining workflows to ensure consistency and high-quality outputs.

Learn more about how Galileo can enhance your AI initiatives.

Also, listen to the entire episode on the "Chain of Thought" podcast where Galileo’s Co-founder and CTO, Atindriyo Sanyal, joins Chip Huyen (Storyteller, Tép Studio) and Vivienne Zhang (Senior Product Manager, Generative AI Software, Nvidia) to dive deeper into the practical lessons learned from GenAI evaluations.

Generative AI has moved beyond crunching numbers and is creating and imagining things like never before. But with this power comes new challenges when it comes to evaluating these systems.

On a recent "Chain of Thought" podcast episode, Conor Bronsdon, Head of Developer Awareness at Galileo; Vikram Chatterji, Co-founder and CEO of Galileo, shared their insights on what makes assessing Generative AI so unique.

We recently explored this topic on our Chain of Thought podcast, where industry experts shared practical insights and real-world implementation strategies.

Understanding the Challenges of Evaluating Generative AI

Evaluating Generative AI isn’t the same as testing your usual software. Because these systems are smarter and more complex, you need a different approach. It’s not just about whether something works, but how well it creates.

You need to use new effective AI evaluation methods that dive deeper into what the AI is producing.

Differences from Traditional Software Evaluation

With traditional software, it’s usually a clear-cut process. You have rules and outcomes that are easy to measure. Generative AI, especially large language models, don’t follow a set path. They’re unpredictable.

So the old metrics just don’t cut it. Chatterji points out, "you have to have a training set. You have to have a test set. You have to compare the two over time." It’s all about keeping up with what’s changing.

Plus, a lot of developers are jumping into AI without a deep data science background. With over 30 million developers using AI models through APIs, everyone needs to adopt a more data-focused approach in their evaluations.

Need for New Metrics and Evaluation Methods

Generative AI needs its own set of rules to be evaluated properly, and practical evaluation tips can help. We’re talking about things like how good the prompts are, which model you choose, and how you use vector stores—stuff that wasn’t on our radar with traditional software.

When you bring in multi-agent workflows and models that handle different types of data, evaluation gets even trickier. Bronsdon breaks it down: "Builders want to evaluate three parts of that agentic workflow... the right tool chosen and used correctly at each step."

At the end of the day, creating objective criteria for Generative AI is tough. The tasks these AIs perform can be wide-ranging and open-ended, making outcomes seem subjective. Evaluations are just hard, which highlights the ongoing challenge of setting standards for accuracy and effectiveness.

The Role of AI Agents in Evaluation Complexity

When evaluating AI agents, you have to check their work at multiple levels. AI agents use models to tackle tasks step by step, meaning you have to check their work at multiple levels. AI agents use LLMs to plan their actions and often incorporate multi-turn or multi-agent workflows.

Evaluating this layered process requires checking each step, turn, and session to ensure everything runs smoothly.

AI Agents and LLMs in Workflow

In systems where AI acts as an agent, every stage of the workflow involves important decisions that need to be checked. Builders want to evaluate at three parts: Was the right tool chosen and used correctly at each step?... Were the steps performed in the correct order?... Is the final result accurate?.

Breaking it down like this helps make sure the AI is working right and producing reliable results. But even with all this potential, surveys show that less than 15% of AI builders are actually using and evaluating these technologies effectively.

Metrics and Tools for Evaluating AI Agents

Evaluating AI agents isn’t simple. It requires advanced metrics, AI evaluation tools, and logging systems that can handle different situations. You can't rely on traditional metrics such as cost and latency. New metrics and testing methodologies are needed.

AI builders should create their own evaluation frameworks tailored to the specific AI applications they’re working with. That means setting up clear guidelines and robust logging systems that track every step the agent takes—which API call was made, what the responses looked like, and more.

Such a tailored approach is key. It’s not about one-size-fits-all metrics, but about what fits best for each particular AI system. It requires constantly updating and refining how you evaluate things to keep up with the changing capabilities of AI.

Chatterji points out, "The part which becomes interesting... is for your specific agent, how do you figure out what the right kind of metrics are," encouraging a move towards evaluation-driven development to better integrate AI into businesses.

Prioritizing Evaluation Metrics in Non-Deterministic Systems

Generative AI is always changing, and evaluating these models is both important and tough. These models can be opaque, after all. So how do you assess something when you can’t predict what’s going to happen?

Objective Criteria in AI Evaluation

The big question is: how do you create objective criteria for Generative AI? Traditional software metrics like cost and speed aren’t enough. These metrics don't exist. We need new ways to evaluate AI that are built just for it.

One approach is to build an effective LLM evaluation framework from scratch. You have to figure out what's a good test set that you can actually use, what are the right metrics that you can actually use.

It’s a new challenge, but necessary for getting accurate evaluations.

Human-in-the-Loop Considerations

Adding humans into the evaluation process is crucial. Human-in-the-loop (HITL) will never go away, especially because AI can produce unpredictable results. Humans are needed to understand subtle responses and handle unexpected cases.

With edge cases somebody with semantic understanding of the topic is critical. With humans overseeing AI, we can fine-tune outputs to better fit specific needs and maintain compliance.

To wrap it up, evaluating AI systems is naturally subjective because these systems aren't always predictable. But it’s essential for building strong and reliable AI. By focusing on guidelines, specific use cases, and human oversight, we can bridge the gap between what humans need and what machines can do.

As AI continues to evolve, improving these evaluation methods will lead to better observability and security, sparking more innovation in AI development.

Dealing with Hallucinations in AI Systems

Understanding AI Hallucinations

In AI, "hallucinations" happen when models create responses that don’t match the input data or real-world context. Such occurrences are common in generative AI, as seen in studies like the recent AI hallucinations survey, where the output can sound plausible but is actually wrong or irrelevant.

"A hallucination is basically something that the model is coming up with, and it might not be something that makes a lot of sense," explains Chatterji.

Hallucinations can occur in both closed and open-book AI systems. In closed-book systems, which don’t use external data, hallucinations might happen because the model tries to fill in gaps with its own ideas.

In open-book systems, which access outside information, hallucinations can occur if the needed data is missing or retrieved incorrectly, causing responses that don’t fit the context.

While some creative uses might embrace these unexpected outputs, in fields where accuracy matters—like healthcare or finance—hallucinations are a big problem.

Strategies to Manage Hallucinations

To keep AI hallucinations in check, clear contexts and strong oversight are key. Implementing a framework for detecting hallucinations can help reduce mistakes, especially in open-book systems where the quality of retrieved data matters. Grounding AI models in accurate, domain-specific knowledge can lower the chances of generating irrelevant or wrong information.

Having humans oversee AI systems is also crucial. Human-in-the-loop approaches let experts check outputs, particularly in high-stakes areas.

This oversight ensures AI follows the right guidelines and meets AI compliance standards, keeping AI applications reliable across different industries.

In short, tackling AI hallucinations takes a layered approach: defining clear contexts, integrating quality data, and maintaining human oversight. This way, industries can make the most of AI while avoiding its potential downsides.

The Importance of Human-in-the-Loop Systems

Human-in-the-loop (HITL) systems are crucial when evaluating AI. Even though AI is powerful, it often hits unknown edge cases where it falls short. In the debate between LLM evaluation vs human, human intervention helps refine metrics and test sets, filling in the gaps where AI might struggle.

Generative AI models bring unique challenges compared to traditional software. Evaluations have always been a science. As AI becomes more widespread and easier to implement through APIs, there's a growing need for diverse expertise in the evaluation process.

Human evaluators play a key role in handling complex scenarios that AI might miss. Just like software quality testing relies on human insights to catch nuances and potential problems, AI evaluation benefits greatly from human input.

Chatterji points out that while software engineers focus on architecture and data use, AI’s unpredictable nature—the fact that there are no predefined metrics for quality beyond cost and latency—means humans need to step in to spot and fix errors.

In AI evaluation, humans add a level of scrutiny that algorithms alone can’t provide. They make sure AI systems perform as expected by checking real-world applicability and compliance with industry standards.

Such a partnership ensures that AI applications are reliable, user-friendly, and ethically sound.

The AI Measurement Imperative

As we navigate the complexities of evaluating Generative AI, observability tools like Galileo become essential. Galileo provides robust solutions for evaluating, monitoring best practices, and protecting AI applications, streamlining workflows to ensure consistency and high-quality outputs.

Learn more about how Galileo can enhance your AI initiatives.

Also, listen to the entire episode on the "Chain of Thought" podcast where Galileo’s Co-founder and CTO, Atindriyo Sanyal, joins Chip Huyen (Storyteller, Tép Studio) and Vivienne Zhang (Senior Product Manager, Generative AI Software, Nvidia) to dive deeper into the practical lessons learned from GenAI evaluations.

Generative AI has moved beyond crunching numbers and is creating and imagining things like never before. But with this power comes new challenges when it comes to evaluating these systems.

On a recent "Chain of Thought" podcast episode, Conor Bronsdon, Head of Developer Awareness at Galileo; Vikram Chatterji, Co-founder and CEO of Galileo, shared their insights on what makes assessing Generative AI so unique.

We recently explored this topic on our Chain of Thought podcast, where industry experts shared practical insights and real-world implementation strategies.

Understanding the Challenges of Evaluating Generative AI

Evaluating Generative AI isn’t the same as testing your usual software. Because these systems are smarter and more complex, you need a different approach. It’s not just about whether something works, but how well it creates.

You need to use new effective AI evaluation methods that dive deeper into what the AI is producing.

Differences from Traditional Software Evaluation

With traditional software, it’s usually a clear-cut process. You have rules and outcomes that are easy to measure. Generative AI, especially large language models, don’t follow a set path. They’re unpredictable.

So the old metrics just don’t cut it. Chatterji points out, "you have to have a training set. You have to have a test set. You have to compare the two over time." It’s all about keeping up with what’s changing.

Plus, a lot of developers are jumping into AI without a deep data science background. With over 30 million developers using AI models through APIs, everyone needs to adopt a more data-focused approach in their evaluations.

Need for New Metrics and Evaluation Methods

Generative AI needs its own set of rules to be evaluated properly, and practical evaluation tips can help. We’re talking about things like how good the prompts are, which model you choose, and how you use vector stores—stuff that wasn’t on our radar with traditional software.

When you bring in multi-agent workflows and models that handle different types of data, evaluation gets even trickier. Bronsdon breaks it down: "Builders want to evaluate three parts of that agentic workflow... the right tool chosen and used correctly at each step."

At the end of the day, creating objective criteria for Generative AI is tough. The tasks these AIs perform can be wide-ranging and open-ended, making outcomes seem subjective. Evaluations are just hard, which highlights the ongoing challenge of setting standards for accuracy and effectiveness.

The Role of AI Agents in Evaluation Complexity

When evaluating AI agents, you have to check their work at multiple levels. AI agents use models to tackle tasks step by step, meaning you have to check their work at multiple levels. AI agents use LLMs to plan their actions and often incorporate multi-turn or multi-agent workflows.

Evaluating this layered process requires checking each step, turn, and session to ensure everything runs smoothly.

AI Agents and LLMs in Workflow

In systems where AI acts as an agent, every stage of the workflow involves important decisions that need to be checked. Builders want to evaluate at three parts: Was the right tool chosen and used correctly at each step?... Were the steps performed in the correct order?... Is the final result accurate?.

Breaking it down like this helps make sure the AI is working right and producing reliable results. But even with all this potential, surveys show that less than 15% of AI builders are actually using and evaluating these technologies effectively.

Metrics and Tools for Evaluating AI Agents

Evaluating AI agents isn’t simple. It requires advanced metrics, AI evaluation tools, and logging systems that can handle different situations. You can't rely on traditional metrics such as cost and latency. New metrics and testing methodologies are needed.

AI builders should create their own evaluation frameworks tailored to the specific AI applications they’re working with. That means setting up clear guidelines and robust logging systems that track every step the agent takes—which API call was made, what the responses looked like, and more.

Such a tailored approach is key. It’s not about one-size-fits-all metrics, but about what fits best for each particular AI system. It requires constantly updating and refining how you evaluate things to keep up with the changing capabilities of AI.

Chatterji points out, "The part which becomes interesting... is for your specific agent, how do you figure out what the right kind of metrics are," encouraging a move towards evaluation-driven development to better integrate AI into businesses.

Prioritizing Evaluation Metrics in Non-Deterministic Systems

Generative AI is always changing, and evaluating these models is both important and tough. These models can be opaque, after all. So how do you assess something when you can’t predict what’s going to happen?

Objective Criteria in AI Evaluation

The big question is: how do you create objective criteria for Generative AI? Traditional software metrics like cost and speed aren’t enough. These metrics don't exist. We need new ways to evaluate AI that are built just for it.

One approach is to build an effective LLM evaluation framework from scratch. You have to figure out what's a good test set that you can actually use, what are the right metrics that you can actually use.

It’s a new challenge, but necessary for getting accurate evaluations.

Human-in-the-Loop Considerations

Adding humans into the evaluation process is crucial. Human-in-the-loop (HITL) will never go away, especially because AI can produce unpredictable results. Humans are needed to understand subtle responses and handle unexpected cases.

With edge cases somebody with semantic understanding of the topic is critical. With humans overseeing AI, we can fine-tune outputs to better fit specific needs and maintain compliance.

To wrap it up, evaluating AI systems is naturally subjective because these systems aren't always predictable. But it’s essential for building strong and reliable AI. By focusing on guidelines, specific use cases, and human oversight, we can bridge the gap between what humans need and what machines can do.

As AI continues to evolve, improving these evaluation methods will lead to better observability and security, sparking more innovation in AI development.

Dealing with Hallucinations in AI Systems

Understanding AI Hallucinations

In AI, "hallucinations" happen when models create responses that don’t match the input data or real-world context. Such occurrences are common in generative AI, as seen in studies like the recent AI hallucinations survey, where the output can sound plausible but is actually wrong or irrelevant.

"A hallucination is basically something that the model is coming up with, and it might not be something that makes a lot of sense," explains Chatterji.

Hallucinations can occur in both closed and open-book AI systems. In closed-book systems, which don’t use external data, hallucinations might happen because the model tries to fill in gaps with its own ideas.

In open-book systems, which access outside information, hallucinations can occur if the needed data is missing or retrieved incorrectly, causing responses that don’t fit the context.

While some creative uses might embrace these unexpected outputs, in fields where accuracy matters—like healthcare or finance—hallucinations are a big problem.

Strategies to Manage Hallucinations

To keep AI hallucinations in check, clear contexts and strong oversight are key. Implementing a framework for detecting hallucinations can help reduce mistakes, especially in open-book systems where the quality of retrieved data matters. Grounding AI models in accurate, domain-specific knowledge can lower the chances of generating irrelevant or wrong information.

Having humans oversee AI systems is also crucial. Human-in-the-loop approaches let experts check outputs, particularly in high-stakes areas.

This oversight ensures AI follows the right guidelines and meets AI compliance standards, keeping AI applications reliable across different industries.

In short, tackling AI hallucinations takes a layered approach: defining clear contexts, integrating quality data, and maintaining human oversight. This way, industries can make the most of AI while avoiding its potential downsides.

The Importance of Human-in-the-Loop Systems

Human-in-the-loop (HITL) systems are crucial when evaluating AI. Even though AI is powerful, it often hits unknown edge cases where it falls short. In the debate between LLM evaluation vs human, human intervention helps refine metrics and test sets, filling in the gaps where AI might struggle.

Generative AI models bring unique challenges compared to traditional software. Evaluations have always been a science. As AI becomes more widespread and easier to implement through APIs, there's a growing need for diverse expertise in the evaluation process.

Human evaluators play a key role in handling complex scenarios that AI might miss. Just like software quality testing relies on human insights to catch nuances and potential problems, AI evaluation benefits greatly from human input.

Chatterji points out that while software engineers focus on architecture and data use, AI’s unpredictable nature—the fact that there are no predefined metrics for quality beyond cost and latency—means humans need to step in to spot and fix errors.

In AI evaluation, humans add a level of scrutiny that algorithms alone can’t provide. They make sure AI systems perform as expected by checking real-world applicability and compliance with industry standards.

Such a partnership ensures that AI applications are reliable, user-friendly, and ethically sound.

The AI Measurement Imperative

As we navigate the complexities of evaluating Generative AI, observability tools like Galileo become essential. Galileo provides robust solutions for evaluating, monitoring best practices, and protecting AI applications, streamlining workflows to ensure consistency and high-quality outputs.

Learn more about how Galileo can enhance your AI initiatives.

Also, listen to the entire episode on the "Chain of Thought" podcast where Galileo’s Co-founder and CTO, Atindriyo Sanyal, joins Chip Huyen (Storyteller, Tép Studio) and Vivienne Zhang (Senior Product Manager, Generative AI Software, Nvidia) to dive deeper into the practical lessons learned from GenAI evaluations.

Generative AI has moved beyond crunching numbers and is creating and imagining things like never before. But with this power comes new challenges when it comes to evaluating these systems.

On a recent "Chain of Thought" podcast episode, Conor Bronsdon, Head of Developer Awareness at Galileo; Vikram Chatterji, Co-founder and CEO of Galileo, shared their insights on what makes assessing Generative AI so unique.

We recently explored this topic on our Chain of Thought podcast, where industry experts shared practical insights and real-world implementation strategies.

Understanding the Challenges of Evaluating Generative AI

Evaluating Generative AI isn’t the same as testing your usual software. Because these systems are smarter and more complex, you need a different approach. It’s not just about whether something works, but how well it creates.

You need to use new effective AI evaluation methods that dive deeper into what the AI is producing.

Differences from Traditional Software Evaluation

With traditional software, it’s usually a clear-cut process. You have rules and outcomes that are easy to measure. Generative AI, especially large language models, don’t follow a set path. They’re unpredictable.

So the old metrics just don’t cut it. Chatterji points out, "you have to have a training set. You have to have a test set. You have to compare the two over time." It’s all about keeping up with what’s changing.

Plus, a lot of developers are jumping into AI without a deep data science background. With over 30 million developers using AI models through APIs, everyone needs to adopt a more data-focused approach in their evaluations.

Need for New Metrics and Evaluation Methods

Generative AI needs its own set of rules to be evaluated properly, and practical evaluation tips can help. We’re talking about things like how good the prompts are, which model you choose, and how you use vector stores—stuff that wasn’t on our radar with traditional software.

When you bring in multi-agent workflows and models that handle different types of data, evaluation gets even trickier. Bronsdon breaks it down: "Builders want to evaluate three parts of that agentic workflow... the right tool chosen and used correctly at each step."

At the end of the day, creating objective criteria for Generative AI is tough. The tasks these AIs perform can be wide-ranging and open-ended, making outcomes seem subjective. Evaluations are just hard, which highlights the ongoing challenge of setting standards for accuracy and effectiveness.

The Role of AI Agents in Evaluation Complexity

When evaluating AI agents, you have to check their work at multiple levels. AI agents use models to tackle tasks step by step, meaning you have to check their work at multiple levels. AI agents use LLMs to plan their actions and often incorporate multi-turn or multi-agent workflows.

Evaluating this layered process requires checking each step, turn, and session to ensure everything runs smoothly.

AI Agents and LLMs in Workflow

In systems where AI acts as an agent, every stage of the workflow involves important decisions that need to be checked. Builders want to evaluate at three parts: Was the right tool chosen and used correctly at each step?... Were the steps performed in the correct order?... Is the final result accurate?.

Breaking it down like this helps make sure the AI is working right and producing reliable results. But even with all this potential, surveys show that less than 15% of AI builders are actually using and evaluating these technologies effectively.

Metrics and Tools for Evaluating AI Agents

Evaluating AI agents isn’t simple. It requires advanced metrics, AI evaluation tools, and logging systems that can handle different situations. You can't rely on traditional metrics such as cost and latency. New metrics and testing methodologies are needed.

AI builders should create their own evaluation frameworks tailored to the specific AI applications they’re working with. That means setting up clear guidelines and robust logging systems that track every step the agent takes—which API call was made, what the responses looked like, and more.

Such a tailored approach is key. It’s not about one-size-fits-all metrics, but about what fits best for each particular AI system. It requires constantly updating and refining how you evaluate things to keep up with the changing capabilities of AI.

Chatterji points out, "The part which becomes interesting... is for your specific agent, how do you figure out what the right kind of metrics are," encouraging a move towards evaluation-driven development to better integrate AI into businesses.

Prioritizing Evaluation Metrics in Non-Deterministic Systems

Generative AI is always changing, and evaluating these models is both important and tough. These models can be opaque, after all. So how do you assess something when you can’t predict what’s going to happen?

Objective Criteria in AI Evaluation

The big question is: how do you create objective criteria for Generative AI? Traditional software metrics like cost and speed aren’t enough. These metrics don't exist. We need new ways to evaluate AI that are built just for it.

One approach is to build an effective LLM evaluation framework from scratch. You have to figure out what's a good test set that you can actually use, what are the right metrics that you can actually use.

It’s a new challenge, but necessary for getting accurate evaluations.

Human-in-the-Loop Considerations

Adding humans into the evaluation process is crucial. Human-in-the-loop (HITL) will never go away, especially because AI can produce unpredictable results. Humans are needed to understand subtle responses and handle unexpected cases.

With edge cases somebody with semantic understanding of the topic is critical. With humans overseeing AI, we can fine-tune outputs to better fit specific needs and maintain compliance.

To wrap it up, evaluating AI systems is naturally subjective because these systems aren't always predictable. But it’s essential for building strong and reliable AI. By focusing on guidelines, specific use cases, and human oversight, we can bridge the gap between what humans need and what machines can do.

As AI continues to evolve, improving these evaluation methods will lead to better observability and security, sparking more innovation in AI development.

Dealing with Hallucinations in AI Systems

Understanding AI Hallucinations

In AI, "hallucinations" happen when models create responses that don’t match the input data or real-world context. Such occurrences are common in generative AI, as seen in studies like the recent AI hallucinations survey, where the output can sound plausible but is actually wrong or irrelevant.

"A hallucination is basically something that the model is coming up with, and it might not be something that makes a lot of sense," explains Chatterji.

Hallucinations can occur in both closed and open-book AI systems. In closed-book systems, which don’t use external data, hallucinations might happen because the model tries to fill in gaps with its own ideas.

In open-book systems, which access outside information, hallucinations can occur if the needed data is missing or retrieved incorrectly, causing responses that don’t fit the context.

While some creative uses might embrace these unexpected outputs, in fields where accuracy matters—like healthcare or finance—hallucinations are a big problem.

Strategies to Manage Hallucinations

To keep AI hallucinations in check, clear contexts and strong oversight are key. Implementing a framework for detecting hallucinations can help reduce mistakes, especially in open-book systems where the quality of retrieved data matters. Grounding AI models in accurate, domain-specific knowledge can lower the chances of generating irrelevant or wrong information.

Having humans oversee AI systems is also crucial. Human-in-the-loop approaches let experts check outputs, particularly in high-stakes areas.

This oversight ensures AI follows the right guidelines and meets AI compliance standards, keeping AI applications reliable across different industries.

In short, tackling AI hallucinations takes a layered approach: defining clear contexts, integrating quality data, and maintaining human oversight. This way, industries can make the most of AI while avoiding its potential downsides.

The Importance of Human-in-the-Loop Systems

Human-in-the-loop (HITL) systems are crucial when evaluating AI. Even though AI is powerful, it often hits unknown edge cases where it falls short. In the debate between LLM evaluation vs human, human intervention helps refine metrics and test sets, filling in the gaps where AI might struggle.

Generative AI models bring unique challenges compared to traditional software. Evaluations have always been a science. As AI becomes more widespread and easier to implement through APIs, there's a growing need for diverse expertise in the evaluation process.

Human evaluators play a key role in handling complex scenarios that AI might miss. Just like software quality testing relies on human insights to catch nuances and potential problems, AI evaluation benefits greatly from human input.

Chatterji points out that while software engineers focus on architecture and data use, AI’s unpredictable nature—the fact that there are no predefined metrics for quality beyond cost and latency—means humans need to step in to spot and fix errors.

In AI evaluation, humans add a level of scrutiny that algorithms alone can’t provide. They make sure AI systems perform as expected by checking real-world applicability and compliance with industry standards.

Such a partnership ensures that AI applications are reliable, user-friendly, and ethically sound.

The AI Measurement Imperative

As we navigate the complexities of evaluating Generative AI, observability tools like Galileo become essential. Galileo provides robust solutions for evaluating, monitoring best practices, and protecting AI applications, streamlining workflows to ensure consistency and high-quality outputs.

Learn more about how Galileo can enhance your AI initiatives.

Also, listen to the entire episode on the "Chain of Thought" podcast where Galileo’s Co-founder and CTO, Atindriyo Sanyal, joins Chip Huyen (Storyteller, Tép Studio) and Vivienne Zhang (Senior Product Manager, Generative AI Software, Nvidia) to dive deeper into the practical lessons learned from GenAI evaluations.

Back

Evaluating Generative AI: Overcoming Challenges in a Complex Landscape

Understanding the Challenges of Evaluating Generative AI

Differences from Traditional Software Evaluation

Need for New Metrics and Evaluation Methods

The Role of AI Agents in Evaluation Complexity

AI Agents and LLMs in Workflow

Metrics and Tools for Evaluating AI Agents

Prioritizing Evaluation Metrics in Non-Deterministic Systems

Objective Criteria in AI Evaluation

Human-in-the-Loop Considerations

Dealing with Hallucinations in AI Systems

Understanding AI Hallucinations

Strategies to Manage Hallucinations

The Importance of Human-in-the-Loop Systems

The AI Measurement Imperative

Understanding the Challenges of Evaluating Generative AI

Differences from Traditional Software Evaluation

Need for New Metrics and Evaluation Methods

The Role of AI Agents in Evaluation Complexity

AI Agents and LLMs in Workflow

Metrics and Tools for Evaluating AI Agents

Prioritizing Evaluation Metrics in Non-Deterministic Systems

Objective Criteria in AI Evaluation

Human-in-the-Loop Considerations

Dealing with Hallucinations in AI Systems

Understanding AI Hallucinations

Strategies to Manage Hallucinations

The Importance of Human-in-the-Loop Systems

The AI Measurement Imperative

Understanding the Challenges of Evaluating Generative AI

Differences from Traditional Software Evaluation

Need for New Metrics and Evaluation Methods

The Role of AI Agents in Evaluation Complexity

AI Agents and LLMs in Workflow

Metrics and Tools for Evaluating AI Agents

Prioritizing Evaluation Metrics in Non-Deterministic Systems

Objective Criteria in AI Evaluation

Human-in-the-Loop Considerations

Dealing with Hallucinations in AI Systems

Understanding AI Hallucinations

Strategies to Manage Hallucinations

The Importance of Human-in-the-Loop Systems

The AI Measurement Imperative

Understanding the Challenges of Evaluating Generative AI

Differences from Traditional Software Evaluation

Need for New Metrics and Evaluation Methods

The Role of AI Agents in Evaluation Complexity

AI Agents and LLMs in Workflow

Metrics and Tools for Evaluating AI Agents

Prioritizing Evaluation Metrics in Non-Deterministic Systems

Objective Criteria in AI Evaluation

Human-in-the-Loop Considerations

Dealing with Hallucinations in AI Systems

Understanding AI Hallucinations

Strategies to Manage Hallucinations

The Importance of Human-in-the-Loop Systems

The AI Measurement Imperative

If you find this helpful and interesting,