Aug 13, 2025

Best LLMs for AI Agents in Insurance

Pratik Bhavsar

Galileo Labs

Pratik Bhavsar

Galileo Labs

Insights From Domain Specific Support Agent Evaluation
Insights From Domain Specific Support Agent Evaluation

Generative AI and agent systems are driving massive change in the insurance industry. Success goes beyond policies and premiums. It is about giving fast, personal support during rising risks and stronger regulatory oversight. Customers want smooth solutions for claims, adjustments, and assessments in settings with heavy compliance like NAIC guidelines. 

AI agents powered by advanced LLMs automate underwriting, claims handling, and fraud detection processes while ensuring kind and accurate interactions. With 85% of CEOs expecting positive ROI from scaled AI investments by 2027, choosing the right LLM is key to avoiding problems like wrong denials that damage trust and profits. 

To help solve the LLM selection challenge, we built a realistic insurance dataset into our Agent Leaderboard v2. We evaluated top models for agents within the insurance sector, and found some intriguing insights which we’ll share throughout this piece. 

→ Explore the Live Leaderboard

→ Explore the Dataset

→ Explore the Code

Emerging AI Agents in Insurance

Recent deployments in 2024-2025 highlight generative AI's role in conversational tools and automation.

Lemonade's Maya and AI Jim: Lemonade's AI-driven platform has processed millions of claims since launch, with Maya handling customer queries and AI Jim automating claims via photo analysis. In 2025, they added real-time fraud alerts, saving users billions in prevented losses and boosting satisfaction through proactive notifications.

GEICO's Kate: GEICO's virtual assistant Kate now integrates generative AI for personalized quotes and coverage advice, handling over 100 million interactions annually. 2025 updates focus on multilingual support and integration with telematics for usage-based insurance.

Allstate's Virtual Assistant Suite: Allstate's AI tools, including chatbots for policy management and claims status, manage 60% of inquiries autonomously. Recent 2025 features incorporate predictive analytics for risk prevention, like alerting users to potential home hazards.

AXA's Clara: AXA's Clara provides 24/7 support for policy insights and claims, using AI to categorize risks and offer tailored recommendations. It has delivered billions of interactions, with 2025 updates emphasizing omnichannel engagement across apps and social media.

Takeaways:

  • These agents automate repetitive tasks, slashing error rates in claims and underwriting.

  • Adoption drives massive savings e.g., billions in fraud prevention and elevates customer loyalty via personalization.

  • Internal tools like copilots enhance productivity without amplifying risks.

Real World Failures of AI Support Agents

AI support agents have shown remarkable promise but also exposed critical weaknesses in real-world deployments. In insurance and beyond, we’ve seen AI replacements backfiring so badly that firms had to rehire humans. These failures underscore the importance of robust guardrails, human oversight, and thorough testing before entrusting AI with customer-facing workflows.

Biased Underwriting at Major Insurers: AI algorithms have been criticized for discriminatory pricing based on flawed data, leading to higher premiums for certain demographics and regulatory probes. For instance, systems trained on historical data perpetuated biases in medical histories, resulting in unfair denials.

Inaccurate Claim Denials: Health insurers using AI have faced backlash for wrongful denials due to algorithmic errors, flagging valid claims as fraudulent and delaying payouts. This eroded trust and sparked lawsuits over inaccurate assessments.

Chatbot Hallucinations and Errors: Similar to Air Canada's case, where a chatbot promised unhonored discounts, insurance bots have given misleading policy advice, leading to disputes. Virgin Money's chatbot inappropriately responded to queries, highlighting conversational risks.

Fraud Detection Overreach: AI systems wrongly flagged legitimate transactions, causing policy cancellations and customer churn. This mirrors broader failures where ML models malfunction in edge cases.

Klarna’s AI Replacement Reversal: Klarna replaced 700 customer-service employees with AI-powered systems only to admit that service quality plunged and customer satisfaction collapsed. Within months, the fintech giant announced it would rehire humans to patch the gaps left by its automated agents (Tech.co). In a separate analysis, testers found ways to confuse Klarna’s bot in unexpected scenarios, revealing brittle handling of edge-case queries.

These high-profile failures remind us that AI agents require rigorous evaluation, transparent error-handling strategies, and seamless human-in-the-loop integrations to prevent costly breakdowns in customer-facing applications.

The Role of AI in Insurance Agents

Since our goal was to evaluate insurance agents in real real-world scenarios, we wanted to capture challenges as close as possible to how humans interact. User interactions with AI agents often involve complex, multifaceted queries that bundle multiple tasks into a single conversation. These requests typically arise from urgent personal or business needs, such as managing the aftermath of an accident, renewing policies, or planning coverage. Here's a breakdown of common request types and the inherent challenges for AI agents.

Common request types

AI agents in insurance must handle a wide array of queries, often requiring precise data handling and quick resolutions. Key examples include:

  • Claims and incident actions: Filing claims for accidents, disputing denials, or setting up policy change alerts. These are common and detail-intensive, involving specifics like incident dates, damages, and witnesses.

  • Policy transfers: Updating beneficiaries, shifting coverage between vehicles or properties, or adjusting for international policies. Requests may include deadlines, recipient details, and verification steps.

  • Status checks: Verifying progress on claims, applications, payments, or renewals. Users demand accurate confirmations, often tied to reference IDs or dates.

  • Payment setups: Configuring automatic premium payments for auto, home, life, or health policies, with customizable amounts, frequencies, and start dates.

  • Information updates: Modifying contact details like phone numbers, addresses, or emails, including temporary changes for relocations.

  • Appointments and consultations: Scheduling sessions with agents for policy reviews, claims adjustments, or new coverage options, specifying preferred times or specialists.

  • Coverage and location services: Generating quotes, comparing plans, or locating adjusters/agents, particularly for multi-state or international scenarios.

  • Product research: Exploring options for auto, home, life, or health insurance, including premium comparisons, terms, and eligibility criteria.

Key challenges for AI agents

These interactions pose significant hurdles due to insurance's regulated, high-stakes nature. What makes them particularly difficult includes:

  • Multi-task bundling: Users often pack 4-7 unrelated goals into one message, forcing the agent to parse, prioritize, sequence actions, and retain context throughout.

  • Urgency and deadlines: Time-sensitive elements like filing claims before expiration or renewals by due dates amplify error risks under pressure.

  • Sensitive data handling: Queries involve personal identifiable information (PII), medical records, and financial data, necessitating strict adherence to regulations like GDPR, HIPAA, or PCI DSS, plus secure verification.

  • Growing regulatory pressure: Frameworks such as NAIC guidelines and the EU AI Act demand transparent, explainable systems. However, LLM-based agents are often "black-box" models, complicating interpretability, auditability, and risk management.

  • Ambiguity and edge cases: Incomplete details (e.g., vague incident reports) or unclear instructions require non-frustrating clarifications, while variations in policies, jurisdictions, and risks add layers of complexity.

  • Interdependencies: Tasks frequently rely on one another, such as validating a claim before coverage updates or eligibility checks prior to approvals, demanding advanced logical reasoning and tool orchestration.

  • Human-like interaction: Users anticipate empathetic, conversational responses in a compliance-heavy field, where mistakes could result in financial losses or legal violations.

AI agents must leverage LLMs that excel in tool selection, context retention, and robust error handling to deliver reliable and compliant performance and ensure seamless support in this demanding sector.

Understanding Agent Leaderboard

Now, let's look at how we baked in these ideas to build a robust dataset and methodology to evaluate insurance agents. Galileo’s Agent Leaderboard v2 is a publicly accessible ranking of 17 leading language models tested on realistic, multi-turn enterprise scenarios across five industries: banking, healthcare, investment, telecom, and insurance Unlike academic benchmarks focusing on one-off tasks, we simulated back-and-forth dialogues with synthetic personas, reflecting the complexity of real customer interaction. Each agent has access to domain-specific tools and is scored on how effectively it navigates them to solve user problems. Results are updated monthly, so you can always compare the latest models and variants.

Example: Insurance scenario

"I need to file a claim for my car accident yesterday, verify my home policy renewal on the 15th, set up automatic premium payments, find an adjuster near my location in Paris, get travel insurance quotes, and configure fraud alerts before I leave for Europe Thursday."

This isn't just about calling the right API. The agent must:

  • Maintain context across 6+ distinct requests

  • Handle time-sensitive coordination

  • Navigate tool dependencies

  • Provide clear confirmations for each goal

https://galileo.ai/agent-leaderboard 

Challenges for Agents in Insurance 

Deploying an AI agent in insurance requires careful planning and robust technology. Here's a breakdown of the key challenges that must be addressed to ensure effective implementation.

Key challenges

These hurdles stem from the sector's emphasis on accuracy, security, and user satisfaction. They include:

  • Regulatory compliance: Models must respect strict data-privacy requirements and maintain audit trails for every action.

  • Multi-step transactions: A single conversation might involve filing a claim, querying a policy schedule, and booking a coverage adjustment.

  • Context preservation: Agents must carry context across turns, coordinating tool calls to internal APIs or external services without losing track of user goals.

  • Latency sensitivity: Customers expect near-real-time responses. Prolonged delays undermine trust and drive users to call human support.

  • Error handling: Edge cases and ambiguous requests require clear validation steps and fallback logic to avoid incomplete or incorrect operations.

Understanding these challenges helps us interpret agent performance metrics and choose models that balance capability, cost, and responsiveness.

Evaluation Criteria

Our analysis focuses on four core metrics drawn from the Agent Leaderboard v2 benchmark:

  • Action Completion (AC)
    Measures end-to-end task success. Did the agent fulfill every user goal in the scenario?

  • Tool Selection Quality (TSQ)
    Captures precision and recall in selecting and parameterizing the correct APIs. High TSQ means fewer unnecessary or erroneous calls.

  • Cost per Session
    Estimates the dollars spent per complete user interaction. Balancing budget constraints with performance needs is critical for high-volume deployments.

  • Average Session Duration
    Reflects latency and turn count. Faster sessions improve user experience but may trade off thoroughness.

With these metrics in mind, let us explore the standout models for insurance agents.

Building a Multi-Domain Synthetic Dataset

Creating a benchmark that truly reflects the demands of enterprise AI requires not just scale, but depth and realism. For Agent Leaderboard v2, we built a multi-domain dataset from the ground up, focusing on five critical sectors: banking, investment, healthcare, telecom, and insurance. Each domain required a distinct set of tools, personas, and scenarios to capture the complexities unique to that sector. Here is how we constructed this dataset.

Step 1: Generating domain-specific tools

The foundation of our dataset is a suite of synthetic tools tailored to each domain. These tools represent the actions, services, or data operations an agent might need when assisting real users. We used Anthropic’s Claude, guided by a structured prompt, to generate each tool in strict JSON schema format. Every tool definition specifies its parameters, required fields, expected input types, and the structure of its response. 

We carefully validated each generated tool to ensure no duplicates and guarantee functional domain coverage. This step ensures the simulated environment is rich and realistic, giving agents a robust toolkit that mirrors actual APIs and services used in enterprise systems.

Example tool call definition:

{
    "description": "Retrieves detailed information about a customer's active insurance policies including coverage details, premium amounts, deductibles, and policy status.",
    "properties": {
      "customer_id": {
        "description": "Unique identifier for the customer whose policies need to be retrieved.",
        "type": "string",
        "title": "Customer_Id"
      },
      "policy_type": {
        "description": "Type of insurance policy to filter results.",
        "type": "string",
        "title": "Policy_Type",
        "enum": [
          "auto",
          "home",
          "life",
          "health",
          "renters",
          "umbrella",
          "business"
        ]
      },
      "policy_status": {
        "description": "Status of policies to include in the search results.",
        "type": "string",
        "title": "Policy_Status",
        "enum": [
          "active",
          "pending",
          "expired",
          "cancelled",
          "all"
        ]
      },
      "include_billing_info": {
        "description": "Whether to include billing and payment information in the response.",
        "type": "boolean",
        "title": "Include_Billing_Info"
      }
    },
    "required": [
      "customer_id",
      "policy_type"
    ],
    "title": "get_customer_policies",
    "type": "object"
  }

Step 2: Designing synthetic personas

After establishing the available tools, we focused on the users themselves. We developed a diverse set of synthetic personas to reflect the range of customers or stakeholders that an enterprise might serve. Each persona is defined by their name, age, occupation, personality traits, tone, and preferred level of communication detail. We prompted Claude to create personas that differ in age group, profession, attitude, and comfort with technology. The validation process checks that each persona is unique and plausible. 

This diversity is key for simulating authentic interactions and ensures that agents are evaluated not only on technical skills but also on adaptability and user-centric behavior.

Example personas:

{
      "description": "List of customer policies with essential coverage and billing details.",
      "type": "object",
      "properties": {
        "policies": {
          "description": "Array of policy objects containing coverage details and current status.",
          "type": "array"
        },
        "total_monthly_premium": {
          "description": "Total monthly premium amount across all returned policies.",
          "type": "number"
        }
      },
      "required": [
        "policies",
        "total_monthly_premium"
      ]
}

Step 3: Crafting challenging scenarios

The final step is where the dataset comes to life. We generated chat scenarios for each domain that combine the available tools and personas. Each scenario is constructed to challenge the agent with 5 to 8 interconnected user goals that must be accomplished within a single conversation. Scenarios are carefully engineered to introduce real-world complexity: hidden parameters, time-sensitive requests, interdependent tasks, tool ambiguity, and potential contradictions. We target a range of failure modes, such as incomplete fulfillment, tool selection errors, or edge-case handling. 

Each scenario also belongs to a specific stress-test category, such as adaptive tool use or scope management, to ensure coverage of different agent capabilities. Every scenario is validated for complexity and correctness before being included in the benchmark.

Example scenarios:

[
  {
    "name": "Margaret Chen",
    "age": 58,
    "occupation": "High School English Teacher nearing retirement",
    "personality_traits": [
      "methodical",
      "skeptical",
      "patient"
    ],
    "tone": "formal",
    "detail_level": "comprehensive"
  },
  {
    "name": "Jamal Williams",
    "age": 32,
    "occupation": "Freelance software developer and tech startup founder",
    "personality_traits": [
      "tech-savvy",
      "impatient",
      "analytical"
    ],
    "tone": "casual",
    "detail_level": "comprehensive"
  }
]

Why a synthetic approach?

We chose a synthetic data approach for several vital reasons. First, generative AI allows us to create an unlimited variety of tools, personas, and scenarios without exposing any real customer data or proprietary information. This eliminates the risk of data leakage or privacy concerns. 

Second, a synthetic approach lets us precisely control each benchmark's difficulty, structure, and coverage. We can systematically probe for known model weaknesses, inject edge cases, and have confidence that no model has “seen” the data before. 

Finally, by designing every component—tools, personas, and scenarios—from scratch, we create a fully isolated, domain-specific testbed that offers fair, repeatable, and transparent evaluation across all models.

The Result

The result is a highly realistic, multi-domain dataset that reflects enterprise AI agents' challenges in the real world. It enables us to benchmark beyond standard basic tool use and explore the agent’s reliability, adaptability, and ability to accomplish real user goals under dynamic and complex conditions.

Simulation For Evaluating AI Agents

Once the tools, personas, and scenarios have been created for each domain, we use a robust simulation pipeline to evaluate how different AI models behave in realistic, multi-turn conversations. This simulation is the heart of the updated Agent Leaderboard, designed to mimic enterprise agents' challenges in production environments.

Step 1: Experiment orchestration

The process begins by selecting which models, domains, and scenario categories to test. For each unique combination, the system spins up parallel experiments. This parallelization allows us to benchmark many models efficiently across hundreds of domain-specific scenarios.

Step 2: The simulation engine

Each experiment simulates a real chat session between three key components:

  • The AI Agent (LLM): This is the model being tested. It acts as an assistant that tries to understand the user’s requests, select the right tools, and complete every goal.

  • The User Simulator: A generative AI system roleplays as the user, using the previously created persona and scenario. It sends the initial message and continues the conversation, adapting based on the agent’s responses and tool outputs.

  • The Tool Simulator: This module responds to the agent’s tool calls, using predefined tool schemas to generate realistic outputs. The agent never interacts with real APIs or sensitive data. Everything is simulated to match the domain's specification.

The simulation loop begins with the user’s first message. The agent reads the conversation history, decides what to do next, and may call one or more tools. Each tool call is simulated, and the responses are fed back into the conversation. The user simulator continues the dialog, pushing the agent through all of the user’s goals and adapting its language and requests based on the agent’s performance.

Step 3: Multi-turn, multi-goal evaluation

Each chat session continues for a fixed number of turns or until the user simulator determines the conversation is complete. The agent must navigate complex, interdependent goals—such as transferring funds, updating preferences, or resolving ambiguous requests—all while coordinating multiple tools and keeping context across turns. We record tool calls, arguments, responses, and conversation flow at every step for later evaluation.

Step 4: Metrics and logging

After each simulation run, we analyze the conversation using two primary metrics:

  • Tool Selection Quality: Did the agent pick the right tool and use it correctly at each turn?

  • Action Completion: Did the agent complete every user goal in the scenario, providing clear confirmation or a correct answer?

  • Cost per Session: Estimates the dollars spent per complete user interaction. 

  • Average Session Duration: Reflects latency and turn count. 

  • Average Turns: Measures the average number of conversational turns (back-and-forth exchanges) per session.

These scores, along with full conversation logs and metadata, are optionally logged to Galileo for advanced tracking and visualization. Results are also saved for each model, domain, and scenario, allowing detailed comparison and reproducibility.

Step 5: Scaling and analysis

Thanks to parallel processing, we can evaluate multiple models across many domains and categories at once. This enables robust benchmarking at scale, with experiment results automatically saved and organized for further analysis.

Why this approach matters

Our simulation pipeline delivers far more than static evaluation. It recreates the back-and-forth, high-pressure conversations agents face in the real world, ensuring models are assessed on accuracy and their ability to adapt, reason, and coordinate actions over multiple turns. This method uncovers strengths and weaknesses that would be missed by simpler, one-shot benchmarks and provides teams with actionable insights into how well their chosen model will work when deployed with real users.

Top Model Picks for Insurance Agents

Here are the results(Aug 13) from our experiments, highlighting key metrics: Action Completion (AC), Tool Selection Quality (TSQ), average cost per session, average session duration, and average turns per session. We've selected the standout models based on their performance in insurance-specific scenarios.

1. Qwen-235b-a22b-instruct-2507 (Alibaba)

  • Action Completion (AC): 0.74

  • Tool Selection Quality (TSQ): 0.85

  • Cost/Session: $0.008

  • Average Duration: 233.5 s

Qwen-235b-a22b-instruct-2507 ranks #1 for insurance workflows, achieving the highest action completion rate among all models with strong tool selection quality. As an open-source reasoning model, it excels at intricate, multi-step processes and adaptive requirements, offering great value for cost-sensitive enterprises that prioritize performance and transparency. Its longer average duration makes it well-suited for compliance-driven, in-depth tasks, making it a strong choice for insurers that need reliability without proprietary constraints.

2. Mistral-medium-2508 (Mistral)

  • Action Completion (AC): 0.70

  • Tool Selection Quality (TSQ): 0.73

  • Cost/Session: $0.019

  • Average Duration: 37.0 s

Mistral-medium-2508 takes second place with solid action completion and balanced tool selection, making it a versatile proprietary option for day-to-day insurance operations. It delivers a great speed-effectiveness balance, ideal for real-time customer interactions and fast claims processing. With its low cost and short processing time, it’s a cost-effective pick for mid-sized insurers aiming for efficiency.

3. GPT-4.1-2025-04-14 (OpenAI)

  • Action Completion (AC): 0.66

  • Tool Selection Quality (TSQ): 0.68

  • Cost/Session: $0.063

  • Average Duration: 25.2 s

GPT-4.1 secures third place, offering strong action completion and reliable tool selection quality. This proprietary model handles complex requirements with ease, making it a dependable option for mission-critical insurance scenarios. Its fast execution and moderate cost make it ideal where speed and compliance are crucial, though it’s now slightly behind emerging leaders in completion rates.

4. Kimi-k2-instruct (Moonshot AI)

  • Action Completion (AC): 0.58

  • Tool Selection Quality (TSQ): 0.88

  • Cost/Session: $0.037

  • Average Duration: 164.9 s

Kimi-k2-instruct shines with excellent tool selection quality that rivals many proprietary models. While its action completion is moderate, it can perform well in complex reasoning tasks like underwriting and fraud detection, where auditability matters. Its longer runtime and affordable cost appeal to insurers who prioritize transparency and control over sheer speed.

Strategic Recommendations

To maximize the value of these LLMs in insurance AI agents, consider the following guidelines:

  • Choose models based on task complexity: For multi-step, compliance-sensitive workflows, favor high Action Completion models. 

  • Plan for error handling: Edge scenarios with low completion rates need fallback guards, such as validation layers or human-in-the-loop checks.

  • Optimize for cost and latency: Balance performance with budget and SLA constraints by benchmarking session costs and response times.

  • Implement safety controls: Enforce policies and restrict tool access for models with inconsistent tool detection to avoid harmful calls.

  • Open source vs. closed source: Use Qwen for baseline operations and closed-source frontrunners for mission-critical tasks.

What Next?

AI agents are reshaping insurance from claims processing to customer service and internal compliance. The Agent Leaderboard v2 offers actionable insights into which LLMs best fit your needs, whether you prioritize completion rates, cost-efficiency, or tool precision. Read the launch blog to dive deeper into our methodology and full benchmark results.

To know more, read our in-depth eBook to learn how to:

  • Choose the right agentic framework for your use case

  • Evaluate and improve AI agent performance

  • Identify failure points and production issues

Generative AI and agent systems are driving massive change in the insurance industry. Success goes beyond policies and premiums. It is about giving fast, personal support during rising risks and stronger regulatory oversight. Customers want smooth solutions for claims, adjustments, and assessments in settings with heavy compliance like NAIC guidelines. 

AI agents powered by advanced LLMs automate underwriting, claims handling, and fraud detection processes while ensuring kind and accurate interactions. With 85% of CEOs expecting positive ROI from scaled AI investments by 2027, choosing the right LLM is key to avoiding problems like wrong denials that damage trust and profits. 

To help solve the LLM selection challenge, we built a realistic insurance dataset into our Agent Leaderboard v2. We evaluated top models for agents within the insurance sector, and found some intriguing insights which we’ll share throughout this piece. 

→ Explore the Live Leaderboard

→ Explore the Dataset

→ Explore the Code

Emerging AI Agents in Insurance

Recent deployments in 2024-2025 highlight generative AI's role in conversational tools and automation.

Lemonade's Maya and AI Jim: Lemonade's AI-driven platform has processed millions of claims since launch, with Maya handling customer queries and AI Jim automating claims via photo analysis. In 2025, they added real-time fraud alerts, saving users billions in prevented losses and boosting satisfaction through proactive notifications.

GEICO's Kate: GEICO's virtual assistant Kate now integrates generative AI for personalized quotes and coverage advice, handling over 100 million interactions annually. 2025 updates focus on multilingual support and integration with telematics for usage-based insurance.

Allstate's Virtual Assistant Suite: Allstate's AI tools, including chatbots for policy management and claims status, manage 60% of inquiries autonomously. Recent 2025 features incorporate predictive analytics for risk prevention, like alerting users to potential home hazards.

AXA's Clara: AXA's Clara provides 24/7 support for policy insights and claims, using AI to categorize risks and offer tailored recommendations. It has delivered billions of interactions, with 2025 updates emphasizing omnichannel engagement across apps and social media.

Takeaways:

  • These agents automate repetitive tasks, slashing error rates in claims and underwriting.

  • Adoption drives massive savings e.g., billions in fraud prevention and elevates customer loyalty via personalization.

  • Internal tools like copilots enhance productivity without amplifying risks.

Real World Failures of AI Support Agents

AI support agents have shown remarkable promise but also exposed critical weaknesses in real-world deployments. In insurance and beyond, we’ve seen AI replacements backfiring so badly that firms had to rehire humans. These failures underscore the importance of robust guardrails, human oversight, and thorough testing before entrusting AI with customer-facing workflows.

Biased Underwriting at Major Insurers: AI algorithms have been criticized for discriminatory pricing based on flawed data, leading to higher premiums for certain demographics and regulatory probes. For instance, systems trained on historical data perpetuated biases in medical histories, resulting in unfair denials.

Inaccurate Claim Denials: Health insurers using AI have faced backlash for wrongful denials due to algorithmic errors, flagging valid claims as fraudulent and delaying payouts. This eroded trust and sparked lawsuits over inaccurate assessments.

Chatbot Hallucinations and Errors: Similar to Air Canada's case, where a chatbot promised unhonored discounts, insurance bots have given misleading policy advice, leading to disputes. Virgin Money's chatbot inappropriately responded to queries, highlighting conversational risks.

Fraud Detection Overreach: AI systems wrongly flagged legitimate transactions, causing policy cancellations and customer churn. This mirrors broader failures where ML models malfunction in edge cases.

Klarna’s AI Replacement Reversal: Klarna replaced 700 customer-service employees with AI-powered systems only to admit that service quality plunged and customer satisfaction collapsed. Within months, the fintech giant announced it would rehire humans to patch the gaps left by its automated agents (Tech.co). In a separate analysis, testers found ways to confuse Klarna’s bot in unexpected scenarios, revealing brittle handling of edge-case queries.

These high-profile failures remind us that AI agents require rigorous evaluation, transparent error-handling strategies, and seamless human-in-the-loop integrations to prevent costly breakdowns in customer-facing applications.

The Role of AI in Insurance Agents

Since our goal was to evaluate insurance agents in real real-world scenarios, we wanted to capture challenges as close as possible to how humans interact. User interactions with AI agents often involve complex, multifaceted queries that bundle multiple tasks into a single conversation. These requests typically arise from urgent personal or business needs, such as managing the aftermath of an accident, renewing policies, or planning coverage. Here's a breakdown of common request types and the inherent challenges for AI agents.

Common request types

AI agents in insurance must handle a wide array of queries, often requiring precise data handling and quick resolutions. Key examples include:

  • Claims and incident actions: Filing claims for accidents, disputing denials, or setting up policy change alerts. These are common and detail-intensive, involving specifics like incident dates, damages, and witnesses.

  • Policy transfers: Updating beneficiaries, shifting coverage between vehicles or properties, or adjusting for international policies. Requests may include deadlines, recipient details, and verification steps.

  • Status checks: Verifying progress on claims, applications, payments, or renewals. Users demand accurate confirmations, often tied to reference IDs or dates.

  • Payment setups: Configuring automatic premium payments for auto, home, life, or health policies, with customizable amounts, frequencies, and start dates.

  • Information updates: Modifying contact details like phone numbers, addresses, or emails, including temporary changes for relocations.

  • Appointments and consultations: Scheduling sessions with agents for policy reviews, claims adjustments, or new coverage options, specifying preferred times or specialists.

  • Coverage and location services: Generating quotes, comparing plans, or locating adjusters/agents, particularly for multi-state or international scenarios.

  • Product research: Exploring options for auto, home, life, or health insurance, including premium comparisons, terms, and eligibility criteria.

Key challenges for AI agents

These interactions pose significant hurdles due to insurance's regulated, high-stakes nature. What makes them particularly difficult includes:

  • Multi-task bundling: Users often pack 4-7 unrelated goals into one message, forcing the agent to parse, prioritize, sequence actions, and retain context throughout.

  • Urgency and deadlines: Time-sensitive elements like filing claims before expiration or renewals by due dates amplify error risks under pressure.

  • Sensitive data handling: Queries involve personal identifiable information (PII), medical records, and financial data, necessitating strict adherence to regulations like GDPR, HIPAA, or PCI DSS, plus secure verification.

  • Growing regulatory pressure: Frameworks such as NAIC guidelines and the EU AI Act demand transparent, explainable systems. However, LLM-based agents are often "black-box" models, complicating interpretability, auditability, and risk management.

  • Ambiguity and edge cases: Incomplete details (e.g., vague incident reports) or unclear instructions require non-frustrating clarifications, while variations in policies, jurisdictions, and risks add layers of complexity.

  • Interdependencies: Tasks frequently rely on one another, such as validating a claim before coverage updates or eligibility checks prior to approvals, demanding advanced logical reasoning and tool orchestration.

  • Human-like interaction: Users anticipate empathetic, conversational responses in a compliance-heavy field, where mistakes could result in financial losses or legal violations.

AI agents must leverage LLMs that excel in tool selection, context retention, and robust error handling to deliver reliable and compliant performance and ensure seamless support in this demanding sector.

Understanding Agent Leaderboard

Now, let's look at how we baked in these ideas to build a robust dataset and methodology to evaluate insurance agents. Galileo’s Agent Leaderboard v2 is a publicly accessible ranking of 17 leading language models tested on realistic, multi-turn enterprise scenarios across five industries: banking, healthcare, investment, telecom, and insurance Unlike academic benchmarks focusing on one-off tasks, we simulated back-and-forth dialogues with synthetic personas, reflecting the complexity of real customer interaction. Each agent has access to domain-specific tools and is scored on how effectively it navigates them to solve user problems. Results are updated monthly, so you can always compare the latest models and variants.

Example: Insurance scenario

"I need to file a claim for my car accident yesterday, verify my home policy renewal on the 15th, set up automatic premium payments, find an adjuster near my location in Paris, get travel insurance quotes, and configure fraud alerts before I leave for Europe Thursday."

This isn't just about calling the right API. The agent must:

  • Maintain context across 6+ distinct requests

  • Handle time-sensitive coordination

  • Navigate tool dependencies

  • Provide clear confirmations for each goal

https://galileo.ai/agent-leaderboard 

Challenges for Agents in Insurance 

Deploying an AI agent in insurance requires careful planning and robust technology. Here's a breakdown of the key challenges that must be addressed to ensure effective implementation.

Key challenges

These hurdles stem from the sector's emphasis on accuracy, security, and user satisfaction. They include:

  • Regulatory compliance: Models must respect strict data-privacy requirements and maintain audit trails for every action.

  • Multi-step transactions: A single conversation might involve filing a claim, querying a policy schedule, and booking a coverage adjustment.

  • Context preservation: Agents must carry context across turns, coordinating tool calls to internal APIs or external services without losing track of user goals.

  • Latency sensitivity: Customers expect near-real-time responses. Prolonged delays undermine trust and drive users to call human support.

  • Error handling: Edge cases and ambiguous requests require clear validation steps and fallback logic to avoid incomplete or incorrect operations.

Understanding these challenges helps us interpret agent performance metrics and choose models that balance capability, cost, and responsiveness.

Evaluation Criteria

Our analysis focuses on four core metrics drawn from the Agent Leaderboard v2 benchmark:

  • Action Completion (AC)
    Measures end-to-end task success. Did the agent fulfill every user goal in the scenario?

  • Tool Selection Quality (TSQ)
    Captures precision and recall in selecting and parameterizing the correct APIs. High TSQ means fewer unnecessary or erroneous calls.

  • Cost per Session
    Estimates the dollars spent per complete user interaction. Balancing budget constraints with performance needs is critical for high-volume deployments.

  • Average Session Duration
    Reflects latency and turn count. Faster sessions improve user experience but may trade off thoroughness.

With these metrics in mind, let us explore the standout models for insurance agents.

Building a Multi-Domain Synthetic Dataset

Creating a benchmark that truly reflects the demands of enterprise AI requires not just scale, but depth and realism. For Agent Leaderboard v2, we built a multi-domain dataset from the ground up, focusing on five critical sectors: banking, investment, healthcare, telecom, and insurance. Each domain required a distinct set of tools, personas, and scenarios to capture the complexities unique to that sector. Here is how we constructed this dataset.

Step 1: Generating domain-specific tools

The foundation of our dataset is a suite of synthetic tools tailored to each domain. These tools represent the actions, services, or data operations an agent might need when assisting real users. We used Anthropic’s Claude, guided by a structured prompt, to generate each tool in strict JSON schema format. Every tool definition specifies its parameters, required fields, expected input types, and the structure of its response. 

We carefully validated each generated tool to ensure no duplicates and guarantee functional domain coverage. This step ensures the simulated environment is rich and realistic, giving agents a robust toolkit that mirrors actual APIs and services used in enterprise systems.

Example tool call definition:

{
    "description": "Retrieves detailed information about a customer's active insurance policies including coverage details, premium amounts, deductibles, and policy status.",
    "properties": {
      "customer_id": {
        "description": "Unique identifier for the customer whose policies need to be retrieved.",
        "type": "string",
        "title": "Customer_Id"
      },
      "policy_type": {
        "description": "Type of insurance policy to filter results.",
        "type": "string",
        "title": "Policy_Type",
        "enum": [
          "auto",
          "home",
          "life",
          "health",
          "renters",
          "umbrella",
          "business"
        ]
      },
      "policy_status": {
        "description": "Status of policies to include in the search results.",
        "type": "string",
        "title": "Policy_Status",
        "enum": [
          "active",
          "pending",
          "expired",
          "cancelled",
          "all"
        ]
      },
      "include_billing_info": {
        "description": "Whether to include billing and payment information in the response.",
        "type": "boolean",
        "title": "Include_Billing_Info"
      }
    },
    "required": [
      "customer_id",
      "policy_type"
    ],
    "title": "get_customer_policies",
    "type": "object"
  }

Step 2: Designing synthetic personas

After establishing the available tools, we focused on the users themselves. We developed a diverse set of synthetic personas to reflect the range of customers or stakeholders that an enterprise might serve. Each persona is defined by their name, age, occupation, personality traits, tone, and preferred level of communication detail. We prompted Claude to create personas that differ in age group, profession, attitude, and comfort with technology. The validation process checks that each persona is unique and plausible. 

This diversity is key for simulating authentic interactions and ensures that agents are evaluated not only on technical skills but also on adaptability and user-centric behavior.

Example personas:

{
      "description": "List of customer policies with essential coverage and billing details.",
      "type": "object",
      "properties": {
        "policies": {
          "description": "Array of policy objects containing coverage details and current status.",
          "type": "array"
        },
        "total_monthly_premium": {
          "description": "Total monthly premium amount across all returned policies.",
          "type": "number"
        }
      },
      "required": [
        "policies",
        "total_monthly_premium"
      ]
}

Step 3: Crafting challenging scenarios

The final step is where the dataset comes to life. We generated chat scenarios for each domain that combine the available tools and personas. Each scenario is constructed to challenge the agent with 5 to 8 interconnected user goals that must be accomplished within a single conversation. Scenarios are carefully engineered to introduce real-world complexity: hidden parameters, time-sensitive requests, interdependent tasks, tool ambiguity, and potential contradictions. We target a range of failure modes, such as incomplete fulfillment, tool selection errors, or edge-case handling. 

Each scenario also belongs to a specific stress-test category, such as adaptive tool use or scope management, to ensure coverage of different agent capabilities. Every scenario is validated for complexity and correctness before being included in the benchmark.

Example scenarios:

[
  {
    "name": "Margaret Chen",
    "age": 58,
    "occupation": "High School English Teacher nearing retirement",
    "personality_traits": [
      "methodical",
      "skeptical",
      "patient"
    ],
    "tone": "formal",
    "detail_level": "comprehensive"
  },
  {
    "name": "Jamal Williams",
    "age": 32,
    "occupation": "Freelance software developer and tech startup founder",
    "personality_traits": [
      "tech-savvy",
      "impatient",
      "analytical"
    ],
    "tone": "casual",
    "detail_level": "comprehensive"
  }
]

Why a synthetic approach?

We chose a synthetic data approach for several vital reasons. First, generative AI allows us to create an unlimited variety of tools, personas, and scenarios without exposing any real customer data or proprietary information. This eliminates the risk of data leakage or privacy concerns. 

Second, a synthetic approach lets us precisely control each benchmark's difficulty, structure, and coverage. We can systematically probe for known model weaknesses, inject edge cases, and have confidence that no model has “seen” the data before. 

Finally, by designing every component—tools, personas, and scenarios—from scratch, we create a fully isolated, domain-specific testbed that offers fair, repeatable, and transparent evaluation across all models.

The Result

The result is a highly realistic, multi-domain dataset that reflects enterprise AI agents' challenges in the real world. It enables us to benchmark beyond standard basic tool use and explore the agent’s reliability, adaptability, and ability to accomplish real user goals under dynamic and complex conditions.

Simulation For Evaluating AI Agents

Once the tools, personas, and scenarios have been created for each domain, we use a robust simulation pipeline to evaluate how different AI models behave in realistic, multi-turn conversations. This simulation is the heart of the updated Agent Leaderboard, designed to mimic enterprise agents' challenges in production environments.

Step 1: Experiment orchestration

The process begins by selecting which models, domains, and scenario categories to test. For each unique combination, the system spins up parallel experiments. This parallelization allows us to benchmark many models efficiently across hundreds of domain-specific scenarios.

Step 2: The simulation engine

Each experiment simulates a real chat session between three key components:

  • The AI Agent (LLM): This is the model being tested. It acts as an assistant that tries to understand the user’s requests, select the right tools, and complete every goal.

  • The User Simulator: A generative AI system roleplays as the user, using the previously created persona and scenario. It sends the initial message and continues the conversation, adapting based on the agent’s responses and tool outputs.

  • The Tool Simulator: This module responds to the agent’s tool calls, using predefined tool schemas to generate realistic outputs. The agent never interacts with real APIs or sensitive data. Everything is simulated to match the domain's specification.

The simulation loop begins with the user’s first message. The agent reads the conversation history, decides what to do next, and may call one or more tools. Each tool call is simulated, and the responses are fed back into the conversation. The user simulator continues the dialog, pushing the agent through all of the user’s goals and adapting its language and requests based on the agent’s performance.

Step 3: Multi-turn, multi-goal evaluation

Each chat session continues for a fixed number of turns or until the user simulator determines the conversation is complete. The agent must navigate complex, interdependent goals—such as transferring funds, updating preferences, or resolving ambiguous requests—all while coordinating multiple tools and keeping context across turns. We record tool calls, arguments, responses, and conversation flow at every step for later evaluation.

Step 4: Metrics and logging

After each simulation run, we analyze the conversation using two primary metrics:

  • Tool Selection Quality: Did the agent pick the right tool and use it correctly at each turn?

  • Action Completion: Did the agent complete every user goal in the scenario, providing clear confirmation or a correct answer?

  • Cost per Session: Estimates the dollars spent per complete user interaction. 

  • Average Session Duration: Reflects latency and turn count. 

  • Average Turns: Measures the average number of conversational turns (back-and-forth exchanges) per session.

These scores, along with full conversation logs and metadata, are optionally logged to Galileo for advanced tracking and visualization. Results are also saved for each model, domain, and scenario, allowing detailed comparison and reproducibility.

Step 5: Scaling and analysis

Thanks to parallel processing, we can evaluate multiple models across many domains and categories at once. This enables robust benchmarking at scale, with experiment results automatically saved and organized for further analysis.

Why this approach matters

Our simulation pipeline delivers far more than static evaluation. It recreates the back-and-forth, high-pressure conversations agents face in the real world, ensuring models are assessed on accuracy and their ability to adapt, reason, and coordinate actions over multiple turns. This method uncovers strengths and weaknesses that would be missed by simpler, one-shot benchmarks and provides teams with actionable insights into how well their chosen model will work when deployed with real users.

Top Model Picks for Insurance Agents

Here are the results(Aug 13) from our experiments, highlighting key metrics: Action Completion (AC), Tool Selection Quality (TSQ), average cost per session, average session duration, and average turns per session. We've selected the standout models based on their performance in insurance-specific scenarios.

1. Qwen-235b-a22b-instruct-2507 (Alibaba)

  • Action Completion (AC): 0.74

  • Tool Selection Quality (TSQ): 0.85

  • Cost/Session: $0.008

  • Average Duration: 233.5 s

Qwen-235b-a22b-instruct-2507 ranks #1 for insurance workflows, achieving the highest action completion rate among all models with strong tool selection quality. As an open-source reasoning model, it excels at intricate, multi-step processes and adaptive requirements, offering great value for cost-sensitive enterprises that prioritize performance and transparency. Its longer average duration makes it well-suited for compliance-driven, in-depth tasks, making it a strong choice for insurers that need reliability without proprietary constraints.

2. Mistral-medium-2508 (Mistral)

  • Action Completion (AC): 0.70

  • Tool Selection Quality (TSQ): 0.73

  • Cost/Session: $0.019

  • Average Duration: 37.0 s

Mistral-medium-2508 takes second place with solid action completion and balanced tool selection, making it a versatile proprietary option for day-to-day insurance operations. It delivers a great speed-effectiveness balance, ideal for real-time customer interactions and fast claims processing. With its low cost and short processing time, it’s a cost-effective pick for mid-sized insurers aiming for efficiency.

3. GPT-4.1-2025-04-14 (OpenAI)

  • Action Completion (AC): 0.66

  • Tool Selection Quality (TSQ): 0.68

  • Cost/Session: $0.063

  • Average Duration: 25.2 s

GPT-4.1 secures third place, offering strong action completion and reliable tool selection quality. This proprietary model handles complex requirements with ease, making it a dependable option for mission-critical insurance scenarios. Its fast execution and moderate cost make it ideal where speed and compliance are crucial, though it’s now slightly behind emerging leaders in completion rates.

4. Kimi-k2-instruct (Moonshot AI)

  • Action Completion (AC): 0.58

  • Tool Selection Quality (TSQ): 0.88

  • Cost/Session: $0.037

  • Average Duration: 164.9 s

Kimi-k2-instruct shines with excellent tool selection quality that rivals many proprietary models. While its action completion is moderate, it can perform well in complex reasoning tasks like underwriting and fraud detection, where auditability matters. Its longer runtime and affordable cost appeal to insurers who prioritize transparency and control over sheer speed.

Strategic Recommendations

To maximize the value of these LLMs in insurance AI agents, consider the following guidelines:

  • Choose models based on task complexity: For multi-step, compliance-sensitive workflows, favor high Action Completion models. 

  • Plan for error handling: Edge scenarios with low completion rates need fallback guards, such as validation layers or human-in-the-loop checks.

  • Optimize for cost and latency: Balance performance with budget and SLA constraints by benchmarking session costs and response times.

  • Implement safety controls: Enforce policies and restrict tool access for models with inconsistent tool detection to avoid harmful calls.

  • Open source vs. closed source: Use Qwen for baseline operations and closed-source frontrunners for mission-critical tasks.

What Next?

AI agents are reshaping insurance from claims processing to customer service and internal compliance. The Agent Leaderboard v2 offers actionable insights into which LLMs best fit your needs, whether you prioritize completion rates, cost-efficiency, or tool precision. Read the launch blog to dive deeper into our methodology and full benchmark results.

To know more, read our in-depth eBook to learn how to:

  • Choose the right agentic framework for your use case

  • Evaluate and improve AI agent performance

  • Identify failure points and production issues

Generative AI and agent systems are driving massive change in the insurance industry. Success goes beyond policies and premiums. It is about giving fast, personal support during rising risks and stronger regulatory oversight. Customers want smooth solutions for claims, adjustments, and assessments in settings with heavy compliance like NAIC guidelines. 

AI agents powered by advanced LLMs automate underwriting, claims handling, and fraud detection processes while ensuring kind and accurate interactions. With 85% of CEOs expecting positive ROI from scaled AI investments by 2027, choosing the right LLM is key to avoiding problems like wrong denials that damage trust and profits. 

To help solve the LLM selection challenge, we built a realistic insurance dataset into our Agent Leaderboard v2. We evaluated top models for agents within the insurance sector, and found some intriguing insights which we’ll share throughout this piece. 

→ Explore the Live Leaderboard

→ Explore the Dataset

→ Explore the Code

Emerging AI Agents in Insurance

Recent deployments in 2024-2025 highlight generative AI's role in conversational tools and automation.

Lemonade's Maya and AI Jim: Lemonade's AI-driven platform has processed millions of claims since launch, with Maya handling customer queries and AI Jim automating claims via photo analysis. In 2025, they added real-time fraud alerts, saving users billions in prevented losses and boosting satisfaction through proactive notifications.

GEICO's Kate: GEICO's virtual assistant Kate now integrates generative AI for personalized quotes and coverage advice, handling over 100 million interactions annually. 2025 updates focus on multilingual support and integration with telematics for usage-based insurance.

Allstate's Virtual Assistant Suite: Allstate's AI tools, including chatbots for policy management and claims status, manage 60% of inquiries autonomously. Recent 2025 features incorporate predictive analytics for risk prevention, like alerting users to potential home hazards.

AXA's Clara: AXA's Clara provides 24/7 support for policy insights and claims, using AI to categorize risks and offer tailored recommendations. It has delivered billions of interactions, with 2025 updates emphasizing omnichannel engagement across apps and social media.

Takeaways:

  • These agents automate repetitive tasks, slashing error rates in claims and underwriting.

  • Adoption drives massive savings e.g., billions in fraud prevention and elevates customer loyalty via personalization.

  • Internal tools like copilots enhance productivity without amplifying risks.

Real World Failures of AI Support Agents

AI support agents have shown remarkable promise but also exposed critical weaknesses in real-world deployments. In insurance and beyond, we’ve seen AI replacements backfiring so badly that firms had to rehire humans. These failures underscore the importance of robust guardrails, human oversight, and thorough testing before entrusting AI with customer-facing workflows.

Biased Underwriting at Major Insurers: AI algorithms have been criticized for discriminatory pricing based on flawed data, leading to higher premiums for certain demographics and regulatory probes. For instance, systems trained on historical data perpetuated biases in medical histories, resulting in unfair denials.

Inaccurate Claim Denials: Health insurers using AI have faced backlash for wrongful denials due to algorithmic errors, flagging valid claims as fraudulent and delaying payouts. This eroded trust and sparked lawsuits over inaccurate assessments.

Chatbot Hallucinations and Errors: Similar to Air Canada's case, where a chatbot promised unhonored discounts, insurance bots have given misleading policy advice, leading to disputes. Virgin Money's chatbot inappropriately responded to queries, highlighting conversational risks.

Fraud Detection Overreach: AI systems wrongly flagged legitimate transactions, causing policy cancellations and customer churn. This mirrors broader failures where ML models malfunction in edge cases.

Klarna’s AI Replacement Reversal: Klarna replaced 700 customer-service employees with AI-powered systems only to admit that service quality plunged and customer satisfaction collapsed. Within months, the fintech giant announced it would rehire humans to patch the gaps left by its automated agents (Tech.co). In a separate analysis, testers found ways to confuse Klarna’s bot in unexpected scenarios, revealing brittle handling of edge-case queries.

These high-profile failures remind us that AI agents require rigorous evaluation, transparent error-handling strategies, and seamless human-in-the-loop integrations to prevent costly breakdowns in customer-facing applications.

The Role of AI in Insurance Agents

Since our goal was to evaluate insurance agents in real real-world scenarios, we wanted to capture challenges as close as possible to how humans interact. User interactions with AI agents often involve complex, multifaceted queries that bundle multiple tasks into a single conversation. These requests typically arise from urgent personal or business needs, such as managing the aftermath of an accident, renewing policies, or planning coverage. Here's a breakdown of common request types and the inherent challenges for AI agents.

Common request types

AI agents in insurance must handle a wide array of queries, often requiring precise data handling and quick resolutions. Key examples include:

  • Claims and incident actions: Filing claims for accidents, disputing denials, or setting up policy change alerts. These are common and detail-intensive, involving specifics like incident dates, damages, and witnesses.

  • Policy transfers: Updating beneficiaries, shifting coverage between vehicles or properties, or adjusting for international policies. Requests may include deadlines, recipient details, and verification steps.

  • Status checks: Verifying progress on claims, applications, payments, or renewals. Users demand accurate confirmations, often tied to reference IDs or dates.

  • Payment setups: Configuring automatic premium payments for auto, home, life, or health policies, with customizable amounts, frequencies, and start dates.

  • Information updates: Modifying contact details like phone numbers, addresses, or emails, including temporary changes for relocations.

  • Appointments and consultations: Scheduling sessions with agents for policy reviews, claims adjustments, or new coverage options, specifying preferred times or specialists.

  • Coverage and location services: Generating quotes, comparing plans, or locating adjusters/agents, particularly for multi-state or international scenarios.

  • Product research: Exploring options for auto, home, life, or health insurance, including premium comparisons, terms, and eligibility criteria.

Key challenges for AI agents

These interactions pose significant hurdles due to insurance's regulated, high-stakes nature. What makes them particularly difficult includes:

  • Multi-task bundling: Users often pack 4-7 unrelated goals into one message, forcing the agent to parse, prioritize, sequence actions, and retain context throughout.

  • Urgency and deadlines: Time-sensitive elements like filing claims before expiration or renewals by due dates amplify error risks under pressure.

  • Sensitive data handling: Queries involve personal identifiable information (PII), medical records, and financial data, necessitating strict adherence to regulations like GDPR, HIPAA, or PCI DSS, plus secure verification.

  • Growing regulatory pressure: Frameworks such as NAIC guidelines and the EU AI Act demand transparent, explainable systems. However, LLM-based agents are often "black-box" models, complicating interpretability, auditability, and risk management.

  • Ambiguity and edge cases: Incomplete details (e.g., vague incident reports) or unclear instructions require non-frustrating clarifications, while variations in policies, jurisdictions, and risks add layers of complexity.

  • Interdependencies: Tasks frequently rely on one another, such as validating a claim before coverage updates or eligibility checks prior to approvals, demanding advanced logical reasoning and tool orchestration.

  • Human-like interaction: Users anticipate empathetic, conversational responses in a compliance-heavy field, where mistakes could result in financial losses or legal violations.

AI agents must leverage LLMs that excel in tool selection, context retention, and robust error handling to deliver reliable and compliant performance and ensure seamless support in this demanding sector.

Understanding Agent Leaderboard

Now, let's look at how we baked in these ideas to build a robust dataset and methodology to evaluate insurance agents. Galileo’s Agent Leaderboard v2 is a publicly accessible ranking of 17 leading language models tested on realistic, multi-turn enterprise scenarios across five industries: banking, healthcare, investment, telecom, and insurance Unlike academic benchmarks focusing on one-off tasks, we simulated back-and-forth dialogues with synthetic personas, reflecting the complexity of real customer interaction. Each agent has access to domain-specific tools and is scored on how effectively it navigates them to solve user problems. Results are updated monthly, so you can always compare the latest models and variants.

Example: Insurance scenario

"I need to file a claim for my car accident yesterday, verify my home policy renewal on the 15th, set up automatic premium payments, find an adjuster near my location in Paris, get travel insurance quotes, and configure fraud alerts before I leave for Europe Thursday."

This isn't just about calling the right API. The agent must:

  • Maintain context across 6+ distinct requests

  • Handle time-sensitive coordination

  • Navigate tool dependencies

  • Provide clear confirmations for each goal

https://galileo.ai/agent-leaderboard 

Challenges for Agents in Insurance 

Deploying an AI agent in insurance requires careful planning and robust technology. Here's a breakdown of the key challenges that must be addressed to ensure effective implementation.

Key challenges

These hurdles stem from the sector's emphasis on accuracy, security, and user satisfaction. They include:

  • Regulatory compliance: Models must respect strict data-privacy requirements and maintain audit trails for every action.

  • Multi-step transactions: A single conversation might involve filing a claim, querying a policy schedule, and booking a coverage adjustment.

  • Context preservation: Agents must carry context across turns, coordinating tool calls to internal APIs or external services without losing track of user goals.

  • Latency sensitivity: Customers expect near-real-time responses. Prolonged delays undermine trust and drive users to call human support.

  • Error handling: Edge cases and ambiguous requests require clear validation steps and fallback logic to avoid incomplete or incorrect operations.

Understanding these challenges helps us interpret agent performance metrics and choose models that balance capability, cost, and responsiveness.

Evaluation Criteria

Our analysis focuses on four core metrics drawn from the Agent Leaderboard v2 benchmark:

  • Action Completion (AC)
    Measures end-to-end task success. Did the agent fulfill every user goal in the scenario?

  • Tool Selection Quality (TSQ)
    Captures precision and recall in selecting and parameterizing the correct APIs. High TSQ means fewer unnecessary or erroneous calls.

  • Cost per Session
    Estimates the dollars spent per complete user interaction. Balancing budget constraints with performance needs is critical for high-volume deployments.

  • Average Session Duration
    Reflects latency and turn count. Faster sessions improve user experience but may trade off thoroughness.

With these metrics in mind, let us explore the standout models for insurance agents.

Building a Multi-Domain Synthetic Dataset

Creating a benchmark that truly reflects the demands of enterprise AI requires not just scale, but depth and realism. For Agent Leaderboard v2, we built a multi-domain dataset from the ground up, focusing on five critical sectors: banking, investment, healthcare, telecom, and insurance. Each domain required a distinct set of tools, personas, and scenarios to capture the complexities unique to that sector. Here is how we constructed this dataset.

Step 1: Generating domain-specific tools

The foundation of our dataset is a suite of synthetic tools tailored to each domain. These tools represent the actions, services, or data operations an agent might need when assisting real users. We used Anthropic’s Claude, guided by a structured prompt, to generate each tool in strict JSON schema format. Every tool definition specifies its parameters, required fields, expected input types, and the structure of its response. 

We carefully validated each generated tool to ensure no duplicates and guarantee functional domain coverage. This step ensures the simulated environment is rich and realistic, giving agents a robust toolkit that mirrors actual APIs and services used in enterprise systems.

Example tool call definition:

{
    "description": "Retrieves detailed information about a customer's active insurance policies including coverage details, premium amounts, deductibles, and policy status.",
    "properties": {
      "customer_id": {
        "description": "Unique identifier for the customer whose policies need to be retrieved.",
        "type": "string",
        "title": "Customer_Id"
      },
      "policy_type": {
        "description": "Type of insurance policy to filter results.",
        "type": "string",
        "title": "Policy_Type",
        "enum": [
          "auto",
          "home",
          "life",
          "health",
          "renters",
          "umbrella",
          "business"
        ]
      },
      "policy_status": {
        "description": "Status of policies to include in the search results.",
        "type": "string",
        "title": "Policy_Status",
        "enum": [
          "active",
          "pending",
          "expired",
          "cancelled",
          "all"
        ]
      },
      "include_billing_info": {
        "description": "Whether to include billing and payment information in the response.",
        "type": "boolean",
        "title": "Include_Billing_Info"
      }
    },
    "required": [
      "customer_id",
      "policy_type"
    ],
    "title": "get_customer_policies",
    "type": "object"
  }

Step 2: Designing synthetic personas

After establishing the available tools, we focused on the users themselves. We developed a diverse set of synthetic personas to reflect the range of customers or stakeholders that an enterprise might serve. Each persona is defined by their name, age, occupation, personality traits, tone, and preferred level of communication detail. We prompted Claude to create personas that differ in age group, profession, attitude, and comfort with technology. The validation process checks that each persona is unique and plausible. 

This diversity is key for simulating authentic interactions and ensures that agents are evaluated not only on technical skills but also on adaptability and user-centric behavior.

Example personas:

{
      "description": "List of customer policies with essential coverage and billing details.",
      "type": "object",
      "properties": {
        "policies": {
          "description": "Array of policy objects containing coverage details and current status.",
          "type": "array"
        },
        "total_monthly_premium": {
          "description": "Total monthly premium amount across all returned policies.",
          "type": "number"
        }
      },
      "required": [
        "policies",
        "total_monthly_premium"
      ]
}

Step 3: Crafting challenging scenarios

The final step is where the dataset comes to life. We generated chat scenarios for each domain that combine the available tools and personas. Each scenario is constructed to challenge the agent with 5 to 8 interconnected user goals that must be accomplished within a single conversation. Scenarios are carefully engineered to introduce real-world complexity: hidden parameters, time-sensitive requests, interdependent tasks, tool ambiguity, and potential contradictions. We target a range of failure modes, such as incomplete fulfillment, tool selection errors, or edge-case handling. 

Each scenario also belongs to a specific stress-test category, such as adaptive tool use or scope management, to ensure coverage of different agent capabilities. Every scenario is validated for complexity and correctness before being included in the benchmark.

Example scenarios:

[
  {
    "name": "Margaret Chen",
    "age": 58,
    "occupation": "High School English Teacher nearing retirement",
    "personality_traits": [
      "methodical",
      "skeptical",
      "patient"
    ],
    "tone": "formal",
    "detail_level": "comprehensive"
  },
  {
    "name": "Jamal Williams",
    "age": 32,
    "occupation": "Freelance software developer and tech startup founder",
    "personality_traits": [
      "tech-savvy",
      "impatient",
      "analytical"
    ],
    "tone": "casual",
    "detail_level": "comprehensive"
  }
]

Why a synthetic approach?

We chose a synthetic data approach for several vital reasons. First, generative AI allows us to create an unlimited variety of tools, personas, and scenarios without exposing any real customer data or proprietary information. This eliminates the risk of data leakage or privacy concerns. 

Second, a synthetic approach lets us precisely control each benchmark's difficulty, structure, and coverage. We can systematically probe for known model weaknesses, inject edge cases, and have confidence that no model has “seen” the data before. 

Finally, by designing every component—tools, personas, and scenarios—from scratch, we create a fully isolated, domain-specific testbed that offers fair, repeatable, and transparent evaluation across all models.

The Result

The result is a highly realistic, multi-domain dataset that reflects enterprise AI agents' challenges in the real world. It enables us to benchmark beyond standard basic tool use and explore the agent’s reliability, adaptability, and ability to accomplish real user goals under dynamic and complex conditions.

Simulation For Evaluating AI Agents

Once the tools, personas, and scenarios have been created for each domain, we use a robust simulation pipeline to evaluate how different AI models behave in realistic, multi-turn conversations. This simulation is the heart of the updated Agent Leaderboard, designed to mimic enterprise agents' challenges in production environments.

Step 1: Experiment orchestration

The process begins by selecting which models, domains, and scenario categories to test. For each unique combination, the system spins up parallel experiments. This parallelization allows us to benchmark many models efficiently across hundreds of domain-specific scenarios.

Step 2: The simulation engine

Each experiment simulates a real chat session between three key components:

  • The AI Agent (LLM): This is the model being tested. It acts as an assistant that tries to understand the user’s requests, select the right tools, and complete every goal.

  • The User Simulator: A generative AI system roleplays as the user, using the previously created persona and scenario. It sends the initial message and continues the conversation, adapting based on the agent’s responses and tool outputs.

  • The Tool Simulator: This module responds to the agent’s tool calls, using predefined tool schemas to generate realistic outputs. The agent never interacts with real APIs or sensitive data. Everything is simulated to match the domain's specification.

The simulation loop begins with the user’s first message. The agent reads the conversation history, decides what to do next, and may call one or more tools. Each tool call is simulated, and the responses are fed back into the conversation. The user simulator continues the dialog, pushing the agent through all of the user’s goals and adapting its language and requests based on the agent’s performance.

Step 3: Multi-turn, multi-goal evaluation

Each chat session continues for a fixed number of turns or until the user simulator determines the conversation is complete. The agent must navigate complex, interdependent goals—such as transferring funds, updating preferences, or resolving ambiguous requests—all while coordinating multiple tools and keeping context across turns. We record tool calls, arguments, responses, and conversation flow at every step for later evaluation.

Step 4: Metrics and logging

After each simulation run, we analyze the conversation using two primary metrics:

  • Tool Selection Quality: Did the agent pick the right tool and use it correctly at each turn?

  • Action Completion: Did the agent complete every user goal in the scenario, providing clear confirmation or a correct answer?

  • Cost per Session: Estimates the dollars spent per complete user interaction. 

  • Average Session Duration: Reflects latency and turn count. 

  • Average Turns: Measures the average number of conversational turns (back-and-forth exchanges) per session.

These scores, along with full conversation logs and metadata, are optionally logged to Galileo for advanced tracking and visualization. Results are also saved for each model, domain, and scenario, allowing detailed comparison and reproducibility.

Step 5: Scaling and analysis

Thanks to parallel processing, we can evaluate multiple models across many domains and categories at once. This enables robust benchmarking at scale, with experiment results automatically saved and organized for further analysis.

Why this approach matters

Our simulation pipeline delivers far more than static evaluation. It recreates the back-and-forth, high-pressure conversations agents face in the real world, ensuring models are assessed on accuracy and their ability to adapt, reason, and coordinate actions over multiple turns. This method uncovers strengths and weaknesses that would be missed by simpler, one-shot benchmarks and provides teams with actionable insights into how well their chosen model will work when deployed with real users.

Top Model Picks for Insurance Agents

Here are the results(Aug 13) from our experiments, highlighting key metrics: Action Completion (AC), Tool Selection Quality (TSQ), average cost per session, average session duration, and average turns per session. We've selected the standout models based on their performance in insurance-specific scenarios.

1. Qwen-235b-a22b-instruct-2507 (Alibaba)

  • Action Completion (AC): 0.74

  • Tool Selection Quality (TSQ): 0.85

  • Cost/Session: $0.008

  • Average Duration: 233.5 s

Qwen-235b-a22b-instruct-2507 ranks #1 for insurance workflows, achieving the highest action completion rate among all models with strong tool selection quality. As an open-source reasoning model, it excels at intricate, multi-step processes and adaptive requirements, offering great value for cost-sensitive enterprises that prioritize performance and transparency. Its longer average duration makes it well-suited for compliance-driven, in-depth tasks, making it a strong choice for insurers that need reliability without proprietary constraints.

2. Mistral-medium-2508 (Mistral)

  • Action Completion (AC): 0.70

  • Tool Selection Quality (TSQ): 0.73

  • Cost/Session: $0.019

  • Average Duration: 37.0 s

Mistral-medium-2508 takes second place with solid action completion and balanced tool selection, making it a versatile proprietary option for day-to-day insurance operations. It delivers a great speed-effectiveness balance, ideal for real-time customer interactions and fast claims processing. With its low cost and short processing time, it’s a cost-effective pick for mid-sized insurers aiming for efficiency.

3. GPT-4.1-2025-04-14 (OpenAI)

  • Action Completion (AC): 0.66

  • Tool Selection Quality (TSQ): 0.68

  • Cost/Session: $0.063

  • Average Duration: 25.2 s

GPT-4.1 secures third place, offering strong action completion and reliable tool selection quality. This proprietary model handles complex requirements with ease, making it a dependable option for mission-critical insurance scenarios. Its fast execution and moderate cost make it ideal where speed and compliance are crucial, though it’s now slightly behind emerging leaders in completion rates.

4. Kimi-k2-instruct (Moonshot AI)

  • Action Completion (AC): 0.58

  • Tool Selection Quality (TSQ): 0.88

  • Cost/Session: $0.037

  • Average Duration: 164.9 s

Kimi-k2-instruct shines with excellent tool selection quality that rivals many proprietary models. While its action completion is moderate, it can perform well in complex reasoning tasks like underwriting and fraud detection, where auditability matters. Its longer runtime and affordable cost appeal to insurers who prioritize transparency and control over sheer speed.

Strategic Recommendations

To maximize the value of these LLMs in insurance AI agents, consider the following guidelines:

  • Choose models based on task complexity: For multi-step, compliance-sensitive workflows, favor high Action Completion models. 

  • Plan for error handling: Edge scenarios with low completion rates need fallback guards, such as validation layers or human-in-the-loop checks.

  • Optimize for cost and latency: Balance performance with budget and SLA constraints by benchmarking session costs and response times.

  • Implement safety controls: Enforce policies and restrict tool access for models with inconsistent tool detection to avoid harmful calls.

  • Open source vs. closed source: Use Qwen for baseline operations and closed-source frontrunners for mission-critical tasks.

What Next?

AI agents are reshaping insurance from claims processing to customer service and internal compliance. The Agent Leaderboard v2 offers actionable insights into which LLMs best fit your needs, whether you prioritize completion rates, cost-efficiency, or tool precision. Read the launch blog to dive deeper into our methodology and full benchmark results.

To know more, read our in-depth eBook to learn how to:

  • Choose the right agentic framework for your use case

  • Evaluate and improve AI agent performance

  • Identify failure points and production issues

Generative AI and agent systems are driving massive change in the insurance industry. Success goes beyond policies and premiums. It is about giving fast, personal support during rising risks and stronger regulatory oversight. Customers want smooth solutions for claims, adjustments, and assessments in settings with heavy compliance like NAIC guidelines. 

AI agents powered by advanced LLMs automate underwriting, claims handling, and fraud detection processes while ensuring kind and accurate interactions. With 85% of CEOs expecting positive ROI from scaled AI investments by 2027, choosing the right LLM is key to avoiding problems like wrong denials that damage trust and profits. 

To help solve the LLM selection challenge, we built a realistic insurance dataset into our Agent Leaderboard v2. We evaluated top models for agents within the insurance sector, and found some intriguing insights which we’ll share throughout this piece. 

→ Explore the Live Leaderboard

→ Explore the Dataset

→ Explore the Code

Emerging AI Agents in Insurance

Recent deployments in 2024-2025 highlight generative AI's role in conversational tools and automation.

Lemonade's Maya and AI Jim: Lemonade's AI-driven platform has processed millions of claims since launch, with Maya handling customer queries and AI Jim automating claims via photo analysis. In 2025, they added real-time fraud alerts, saving users billions in prevented losses and boosting satisfaction through proactive notifications.

GEICO's Kate: GEICO's virtual assistant Kate now integrates generative AI for personalized quotes and coverage advice, handling over 100 million interactions annually. 2025 updates focus on multilingual support and integration with telematics for usage-based insurance.

Allstate's Virtual Assistant Suite: Allstate's AI tools, including chatbots for policy management and claims status, manage 60% of inquiries autonomously. Recent 2025 features incorporate predictive analytics for risk prevention, like alerting users to potential home hazards.

AXA's Clara: AXA's Clara provides 24/7 support for policy insights and claims, using AI to categorize risks and offer tailored recommendations. It has delivered billions of interactions, with 2025 updates emphasizing omnichannel engagement across apps and social media.

Takeaways:

  • These agents automate repetitive tasks, slashing error rates in claims and underwriting.

  • Adoption drives massive savings e.g., billions in fraud prevention and elevates customer loyalty via personalization.

  • Internal tools like copilots enhance productivity without amplifying risks.

Real World Failures of AI Support Agents

AI support agents have shown remarkable promise but also exposed critical weaknesses in real-world deployments. In insurance and beyond, we’ve seen AI replacements backfiring so badly that firms had to rehire humans. These failures underscore the importance of robust guardrails, human oversight, and thorough testing before entrusting AI with customer-facing workflows.

Biased Underwriting at Major Insurers: AI algorithms have been criticized for discriminatory pricing based on flawed data, leading to higher premiums for certain demographics and regulatory probes. For instance, systems trained on historical data perpetuated biases in medical histories, resulting in unfair denials.

Inaccurate Claim Denials: Health insurers using AI have faced backlash for wrongful denials due to algorithmic errors, flagging valid claims as fraudulent and delaying payouts. This eroded trust and sparked lawsuits over inaccurate assessments.

Chatbot Hallucinations and Errors: Similar to Air Canada's case, where a chatbot promised unhonored discounts, insurance bots have given misleading policy advice, leading to disputes. Virgin Money's chatbot inappropriately responded to queries, highlighting conversational risks.

Fraud Detection Overreach: AI systems wrongly flagged legitimate transactions, causing policy cancellations and customer churn. This mirrors broader failures where ML models malfunction in edge cases.

Klarna’s AI Replacement Reversal: Klarna replaced 700 customer-service employees with AI-powered systems only to admit that service quality plunged and customer satisfaction collapsed. Within months, the fintech giant announced it would rehire humans to patch the gaps left by its automated agents (Tech.co). In a separate analysis, testers found ways to confuse Klarna’s bot in unexpected scenarios, revealing brittle handling of edge-case queries.

These high-profile failures remind us that AI agents require rigorous evaluation, transparent error-handling strategies, and seamless human-in-the-loop integrations to prevent costly breakdowns in customer-facing applications.

The Role of AI in Insurance Agents

Since our goal was to evaluate insurance agents in real real-world scenarios, we wanted to capture challenges as close as possible to how humans interact. User interactions with AI agents often involve complex, multifaceted queries that bundle multiple tasks into a single conversation. These requests typically arise from urgent personal or business needs, such as managing the aftermath of an accident, renewing policies, or planning coverage. Here's a breakdown of common request types and the inherent challenges for AI agents.

Common request types

AI agents in insurance must handle a wide array of queries, often requiring precise data handling and quick resolutions. Key examples include:

  • Claims and incident actions: Filing claims for accidents, disputing denials, or setting up policy change alerts. These are common and detail-intensive, involving specifics like incident dates, damages, and witnesses.

  • Policy transfers: Updating beneficiaries, shifting coverage between vehicles or properties, or adjusting for international policies. Requests may include deadlines, recipient details, and verification steps.

  • Status checks: Verifying progress on claims, applications, payments, or renewals. Users demand accurate confirmations, often tied to reference IDs or dates.

  • Payment setups: Configuring automatic premium payments for auto, home, life, or health policies, with customizable amounts, frequencies, and start dates.

  • Information updates: Modifying contact details like phone numbers, addresses, or emails, including temporary changes for relocations.

  • Appointments and consultations: Scheduling sessions with agents for policy reviews, claims adjustments, or new coverage options, specifying preferred times or specialists.

  • Coverage and location services: Generating quotes, comparing plans, or locating adjusters/agents, particularly for multi-state or international scenarios.

  • Product research: Exploring options for auto, home, life, or health insurance, including premium comparisons, terms, and eligibility criteria.

Key challenges for AI agents

These interactions pose significant hurdles due to insurance's regulated, high-stakes nature. What makes them particularly difficult includes:

  • Multi-task bundling: Users often pack 4-7 unrelated goals into one message, forcing the agent to parse, prioritize, sequence actions, and retain context throughout.

  • Urgency and deadlines: Time-sensitive elements like filing claims before expiration or renewals by due dates amplify error risks under pressure.

  • Sensitive data handling: Queries involve personal identifiable information (PII), medical records, and financial data, necessitating strict adherence to regulations like GDPR, HIPAA, or PCI DSS, plus secure verification.

  • Growing regulatory pressure: Frameworks such as NAIC guidelines and the EU AI Act demand transparent, explainable systems. However, LLM-based agents are often "black-box" models, complicating interpretability, auditability, and risk management.

  • Ambiguity and edge cases: Incomplete details (e.g., vague incident reports) or unclear instructions require non-frustrating clarifications, while variations in policies, jurisdictions, and risks add layers of complexity.

  • Interdependencies: Tasks frequently rely on one another, such as validating a claim before coverage updates or eligibility checks prior to approvals, demanding advanced logical reasoning and tool orchestration.

  • Human-like interaction: Users anticipate empathetic, conversational responses in a compliance-heavy field, where mistakes could result in financial losses or legal violations.

AI agents must leverage LLMs that excel in tool selection, context retention, and robust error handling to deliver reliable and compliant performance and ensure seamless support in this demanding sector.

Understanding Agent Leaderboard

Now, let's look at how we baked in these ideas to build a robust dataset and methodology to evaluate insurance agents. Galileo’s Agent Leaderboard v2 is a publicly accessible ranking of 17 leading language models tested on realistic, multi-turn enterprise scenarios across five industries: banking, healthcare, investment, telecom, and insurance Unlike academic benchmarks focusing on one-off tasks, we simulated back-and-forth dialogues with synthetic personas, reflecting the complexity of real customer interaction. Each agent has access to domain-specific tools and is scored on how effectively it navigates them to solve user problems. Results are updated monthly, so you can always compare the latest models and variants.

Example: Insurance scenario

"I need to file a claim for my car accident yesterday, verify my home policy renewal on the 15th, set up automatic premium payments, find an adjuster near my location in Paris, get travel insurance quotes, and configure fraud alerts before I leave for Europe Thursday."

This isn't just about calling the right API. The agent must:

  • Maintain context across 6+ distinct requests

  • Handle time-sensitive coordination

  • Navigate tool dependencies

  • Provide clear confirmations for each goal

https://galileo.ai/agent-leaderboard 

Challenges for Agents in Insurance 

Deploying an AI agent in insurance requires careful planning and robust technology. Here's a breakdown of the key challenges that must be addressed to ensure effective implementation.

Key challenges

These hurdles stem from the sector's emphasis on accuracy, security, and user satisfaction. They include:

  • Regulatory compliance: Models must respect strict data-privacy requirements and maintain audit trails for every action.

  • Multi-step transactions: A single conversation might involve filing a claim, querying a policy schedule, and booking a coverage adjustment.

  • Context preservation: Agents must carry context across turns, coordinating tool calls to internal APIs or external services without losing track of user goals.

  • Latency sensitivity: Customers expect near-real-time responses. Prolonged delays undermine trust and drive users to call human support.

  • Error handling: Edge cases and ambiguous requests require clear validation steps and fallback logic to avoid incomplete or incorrect operations.

Understanding these challenges helps us interpret agent performance metrics and choose models that balance capability, cost, and responsiveness.

Evaluation Criteria

Our analysis focuses on four core metrics drawn from the Agent Leaderboard v2 benchmark:

  • Action Completion (AC)
    Measures end-to-end task success. Did the agent fulfill every user goal in the scenario?

  • Tool Selection Quality (TSQ)
    Captures precision and recall in selecting and parameterizing the correct APIs. High TSQ means fewer unnecessary or erroneous calls.

  • Cost per Session
    Estimates the dollars spent per complete user interaction. Balancing budget constraints with performance needs is critical for high-volume deployments.

  • Average Session Duration
    Reflects latency and turn count. Faster sessions improve user experience but may trade off thoroughness.

With these metrics in mind, let us explore the standout models for insurance agents.

Building a Multi-Domain Synthetic Dataset

Creating a benchmark that truly reflects the demands of enterprise AI requires not just scale, but depth and realism. For Agent Leaderboard v2, we built a multi-domain dataset from the ground up, focusing on five critical sectors: banking, investment, healthcare, telecom, and insurance. Each domain required a distinct set of tools, personas, and scenarios to capture the complexities unique to that sector. Here is how we constructed this dataset.

Step 1: Generating domain-specific tools

The foundation of our dataset is a suite of synthetic tools tailored to each domain. These tools represent the actions, services, or data operations an agent might need when assisting real users. We used Anthropic’s Claude, guided by a structured prompt, to generate each tool in strict JSON schema format. Every tool definition specifies its parameters, required fields, expected input types, and the structure of its response. 

We carefully validated each generated tool to ensure no duplicates and guarantee functional domain coverage. This step ensures the simulated environment is rich and realistic, giving agents a robust toolkit that mirrors actual APIs and services used in enterprise systems.

Example tool call definition:

{
    "description": "Retrieves detailed information about a customer's active insurance policies including coverage details, premium amounts, deductibles, and policy status.",
    "properties": {
      "customer_id": {
        "description": "Unique identifier for the customer whose policies need to be retrieved.",
        "type": "string",
        "title": "Customer_Id"
      },
      "policy_type": {
        "description": "Type of insurance policy to filter results.",
        "type": "string",
        "title": "Policy_Type",
        "enum": [
          "auto",
          "home",
          "life",
          "health",
          "renters",
          "umbrella",
          "business"
        ]
      },
      "policy_status": {
        "description": "Status of policies to include in the search results.",
        "type": "string",
        "title": "Policy_Status",
        "enum": [
          "active",
          "pending",
          "expired",
          "cancelled",
          "all"
        ]
      },
      "include_billing_info": {
        "description": "Whether to include billing and payment information in the response.",
        "type": "boolean",
        "title": "Include_Billing_Info"
      }
    },
    "required": [
      "customer_id",
      "policy_type"
    ],
    "title": "get_customer_policies",
    "type": "object"
  }

Step 2: Designing synthetic personas

After establishing the available tools, we focused on the users themselves. We developed a diverse set of synthetic personas to reflect the range of customers or stakeholders that an enterprise might serve. Each persona is defined by their name, age, occupation, personality traits, tone, and preferred level of communication detail. We prompted Claude to create personas that differ in age group, profession, attitude, and comfort with technology. The validation process checks that each persona is unique and plausible. 

This diversity is key for simulating authentic interactions and ensures that agents are evaluated not only on technical skills but also on adaptability and user-centric behavior.

Example personas:

{
      "description": "List of customer policies with essential coverage and billing details.",
      "type": "object",
      "properties": {
        "policies": {
          "description": "Array of policy objects containing coverage details and current status.",
          "type": "array"
        },
        "total_monthly_premium": {
          "description": "Total monthly premium amount across all returned policies.",
          "type": "number"
        }
      },
      "required": [
        "policies",
        "total_monthly_premium"
      ]
}

Step 3: Crafting challenging scenarios

The final step is where the dataset comes to life. We generated chat scenarios for each domain that combine the available tools and personas. Each scenario is constructed to challenge the agent with 5 to 8 interconnected user goals that must be accomplished within a single conversation. Scenarios are carefully engineered to introduce real-world complexity: hidden parameters, time-sensitive requests, interdependent tasks, tool ambiguity, and potential contradictions. We target a range of failure modes, such as incomplete fulfillment, tool selection errors, or edge-case handling. 

Each scenario also belongs to a specific stress-test category, such as adaptive tool use or scope management, to ensure coverage of different agent capabilities. Every scenario is validated for complexity and correctness before being included in the benchmark.

Example scenarios:

[
  {
    "name": "Margaret Chen",
    "age": 58,
    "occupation": "High School English Teacher nearing retirement",
    "personality_traits": [
      "methodical",
      "skeptical",
      "patient"
    ],
    "tone": "formal",
    "detail_level": "comprehensive"
  },
  {
    "name": "Jamal Williams",
    "age": 32,
    "occupation": "Freelance software developer and tech startup founder",
    "personality_traits": [
      "tech-savvy",
      "impatient",
      "analytical"
    ],
    "tone": "casual",
    "detail_level": "comprehensive"
  }
]

Why a synthetic approach?

We chose a synthetic data approach for several vital reasons. First, generative AI allows us to create an unlimited variety of tools, personas, and scenarios without exposing any real customer data or proprietary information. This eliminates the risk of data leakage or privacy concerns. 

Second, a synthetic approach lets us precisely control each benchmark's difficulty, structure, and coverage. We can systematically probe for known model weaknesses, inject edge cases, and have confidence that no model has “seen” the data before. 

Finally, by designing every component—tools, personas, and scenarios—from scratch, we create a fully isolated, domain-specific testbed that offers fair, repeatable, and transparent evaluation across all models.

The Result

The result is a highly realistic, multi-domain dataset that reflects enterprise AI agents' challenges in the real world. It enables us to benchmark beyond standard basic tool use and explore the agent’s reliability, adaptability, and ability to accomplish real user goals under dynamic and complex conditions.

Simulation For Evaluating AI Agents

Once the tools, personas, and scenarios have been created for each domain, we use a robust simulation pipeline to evaluate how different AI models behave in realistic, multi-turn conversations. This simulation is the heart of the updated Agent Leaderboard, designed to mimic enterprise agents' challenges in production environments.

Step 1: Experiment orchestration

The process begins by selecting which models, domains, and scenario categories to test. For each unique combination, the system spins up parallel experiments. This parallelization allows us to benchmark many models efficiently across hundreds of domain-specific scenarios.

Step 2: The simulation engine

Each experiment simulates a real chat session between three key components:

  • The AI Agent (LLM): This is the model being tested. It acts as an assistant that tries to understand the user’s requests, select the right tools, and complete every goal.

  • The User Simulator: A generative AI system roleplays as the user, using the previously created persona and scenario. It sends the initial message and continues the conversation, adapting based on the agent’s responses and tool outputs.

  • The Tool Simulator: This module responds to the agent’s tool calls, using predefined tool schemas to generate realistic outputs. The agent never interacts with real APIs or sensitive data. Everything is simulated to match the domain's specification.

The simulation loop begins with the user’s first message. The agent reads the conversation history, decides what to do next, and may call one or more tools. Each tool call is simulated, and the responses are fed back into the conversation. The user simulator continues the dialog, pushing the agent through all of the user’s goals and adapting its language and requests based on the agent’s performance.

Step 3: Multi-turn, multi-goal evaluation

Each chat session continues for a fixed number of turns or until the user simulator determines the conversation is complete. The agent must navigate complex, interdependent goals—such as transferring funds, updating preferences, or resolving ambiguous requests—all while coordinating multiple tools and keeping context across turns. We record tool calls, arguments, responses, and conversation flow at every step for later evaluation.

Step 4: Metrics and logging

After each simulation run, we analyze the conversation using two primary metrics:

  • Tool Selection Quality: Did the agent pick the right tool and use it correctly at each turn?

  • Action Completion: Did the agent complete every user goal in the scenario, providing clear confirmation or a correct answer?

  • Cost per Session: Estimates the dollars spent per complete user interaction. 

  • Average Session Duration: Reflects latency and turn count. 

  • Average Turns: Measures the average number of conversational turns (back-and-forth exchanges) per session.

These scores, along with full conversation logs and metadata, are optionally logged to Galileo for advanced tracking and visualization. Results are also saved for each model, domain, and scenario, allowing detailed comparison and reproducibility.

Step 5: Scaling and analysis

Thanks to parallel processing, we can evaluate multiple models across many domains and categories at once. This enables robust benchmarking at scale, with experiment results automatically saved and organized for further analysis.

Why this approach matters

Our simulation pipeline delivers far more than static evaluation. It recreates the back-and-forth, high-pressure conversations agents face in the real world, ensuring models are assessed on accuracy and their ability to adapt, reason, and coordinate actions over multiple turns. This method uncovers strengths and weaknesses that would be missed by simpler, one-shot benchmarks and provides teams with actionable insights into how well their chosen model will work when deployed with real users.

Top Model Picks for Insurance Agents

Here are the results(Aug 13) from our experiments, highlighting key metrics: Action Completion (AC), Tool Selection Quality (TSQ), average cost per session, average session duration, and average turns per session. We've selected the standout models based on their performance in insurance-specific scenarios.

1. Qwen-235b-a22b-instruct-2507 (Alibaba)

  • Action Completion (AC): 0.74

  • Tool Selection Quality (TSQ): 0.85

  • Cost/Session: $0.008

  • Average Duration: 233.5 s

Qwen-235b-a22b-instruct-2507 ranks #1 for insurance workflows, achieving the highest action completion rate among all models with strong tool selection quality. As an open-source reasoning model, it excels at intricate, multi-step processes and adaptive requirements, offering great value for cost-sensitive enterprises that prioritize performance and transparency. Its longer average duration makes it well-suited for compliance-driven, in-depth tasks, making it a strong choice for insurers that need reliability without proprietary constraints.

2. Mistral-medium-2508 (Mistral)

  • Action Completion (AC): 0.70

  • Tool Selection Quality (TSQ): 0.73

  • Cost/Session: $0.019

  • Average Duration: 37.0 s

Mistral-medium-2508 takes second place with solid action completion and balanced tool selection, making it a versatile proprietary option for day-to-day insurance operations. It delivers a great speed-effectiveness balance, ideal for real-time customer interactions and fast claims processing. With its low cost and short processing time, it’s a cost-effective pick for mid-sized insurers aiming for efficiency.

3. GPT-4.1-2025-04-14 (OpenAI)

  • Action Completion (AC): 0.66

  • Tool Selection Quality (TSQ): 0.68

  • Cost/Session: $0.063

  • Average Duration: 25.2 s

GPT-4.1 secures third place, offering strong action completion and reliable tool selection quality. This proprietary model handles complex requirements with ease, making it a dependable option for mission-critical insurance scenarios. Its fast execution and moderate cost make it ideal where speed and compliance are crucial, though it’s now slightly behind emerging leaders in completion rates.

4. Kimi-k2-instruct (Moonshot AI)

  • Action Completion (AC): 0.58

  • Tool Selection Quality (TSQ): 0.88

  • Cost/Session: $0.037

  • Average Duration: 164.9 s

Kimi-k2-instruct shines with excellent tool selection quality that rivals many proprietary models. While its action completion is moderate, it can perform well in complex reasoning tasks like underwriting and fraud detection, where auditability matters. Its longer runtime and affordable cost appeal to insurers who prioritize transparency and control over sheer speed.

Strategic Recommendations

To maximize the value of these LLMs in insurance AI agents, consider the following guidelines:

  • Choose models based on task complexity: For multi-step, compliance-sensitive workflows, favor high Action Completion models. 

  • Plan for error handling: Edge scenarios with low completion rates need fallback guards, such as validation layers or human-in-the-loop checks.

  • Optimize for cost and latency: Balance performance with budget and SLA constraints by benchmarking session costs and response times.

  • Implement safety controls: Enforce policies and restrict tool access for models with inconsistent tool detection to avoid harmful calls.

  • Open source vs. closed source: Use Qwen for baseline operations and closed-source frontrunners for mission-critical tasks.

What Next?

AI agents are reshaping insurance from claims processing to customer service and internal compliance. The Agent Leaderboard v2 offers actionable insights into which LLMs best fit your needs, whether you prioritize completion rates, cost-efficiency, or tool precision. Read the launch blog to dive deeper into our methodology and full benchmark results.

To know more, read our in-depth eBook to learn how to:

  • Choose the right agentic framework for your use case

  • Evaluate and improve AI agent performance

  • Identify failure points and production issues

Pratik Bhavsar

Pratik Bhavsar

Pratik Bhavsar

Pratik Bhavsar