Jul 2, 2025
Silly Startups, Serious Signals: How to Use Custom Metrics to Measure Domain-Specific AI Success


Erin Mikail Staples
Senior Developer Advocate
Erin Mikail Staples
Senior Developer Advocate


Ever dreamed of building an AI that can pitch a startup idea—either as a polished VC-ready business plan or something so absurd it belongs on a meme board? (Oh, and have some way to prove that it works)
In this hands-on tutorial, you’ll build Startup Sim 3000: a Python-powered web app that uses real-time data and large language models (LLMs) to generate creative or professional startup pitches.
But here's the twist: this isn’t just about generating content. You'll also learn how to track, monitor, and measure your system using Agent Reliability tools and custom metrics with Galileo.
What you’ll learn
By the end of this tutorial, you'll have:
Built an AI Agent system that combines multiple tools
Logged tool spans and LLM spans with Galileo
Tracked custom LLM-as-a-Judge Metrics to evaluate AI behavior
But the biggest takeaway? Learning the importance of custom metrics for domain-specific AI Applications, and how to define and measure success using these custom metrics.
> PSST — This application was also featured as a talk at DataBrick’s 2025 Data and AI Conference.
Why custom metrics matter
Large language models (LLMs) are inherently nondeterministic—they don’t behave the same way every time, and that unpredictability is a feature, not a flaw. It’s what makes them powerful for creative and complex tasks. But it also makes these systems hard to evaluate using traditional software metrics like “test pass/fail” or “code coverage.”
It’s one thing to build an AI tool. It’s another thing entirely to know whether it’s doing what you actually want—especially in domain-specific use cases like comedy, medical, fintech, education, creative writing, or business ideation. In these spaces, there's rarely a single “correct” output. The value of an answer is contextual, not binary.
That’s where custom metrics come in. Instead of asking “Did it work?” you can ask:
Which tools are used most?
Are the LLM responses too long?
Is this successful at completing [domain specific] task?
What versions are most accurate?
These are the kinds of questions that matter—but they can’t be answered with traditional developer metrics alone.
With Galileo, you can define and track these domain-specific signals directly. You’ll see what’s working, what’s not, and where your system needs tuning. It’s how you turn a cool demo into a production-ready product.
In this tutorial, we’ll take that idea for a spin in a playful way—by evaluating humor. You’ll learn how to apply custom metrics to a comedy-generating app and measure success based on timing, tone, delivery, and more.
Custom metrics are the bridge between "interesting in theory" and "effective in production."
What you’ll need
Before you get started, have the following on hand.
Some familiarity with Python/Flask
Python Package Manager of choice (we’ll be using uv)
Code editor of choice (VS Code, Cursor, Warp, etc.)
Set up your project
For the sake of jumping right into action — we’ll be starting from an existing application and demonstrating how to add custom metrics to an existing application.
If you’re wanting to explore how to add Galileo to a new agentic application — check out this tutorial on how to create a Weather Vibes Agent or our cookbooks that will walk you through step-by-step how to get started with Galileo.
Set up your Galileo project
If you haven’t already, create a free Galileo account on app.galileo.ai. When prompted, add an organization name. To dive right into this tutorial, you can skip past the onboarding screen by clicking on the Galileo Logo in the upper left hand corner.

NOTE: you will not be able to come back to this screen again, however there are helpful instructions to getting started in the Galileo Docs.
Create a new project by clicking on the ‘new project’ button on the upper right hand screen. You will be prompted to add a project name, as well as a log stream name.
Once that is created. Click on the profile icon in the upper right hand side of the page, navigate on the drop-down menu to API keys. From the API Keys screen, select ‘Create New Key’. Save the key somewhere safe (and secure) for later.

Clone the repo and install dependencies with uv
From within your terminal, run the following:
git clone https://github.com/erinmikailstaples/startup-sim-3000.git
Set up a virtual environment and install dependencies with uv
A virtual environment keeps your project’s dependencies isolated from your global Python installation. For this we’ll be using uv.
On Windows
uv venv source .venv\Scripts\activate uv pip install -r
On MacOS/Linux
uv venv source .venv/bin/activate uv pip install -r
This creates and activates a virtual environment for your project, then installs the necessary requirements.
Set up your environment variables
Grab the example .env file (.env.example) and copy it, preparing to add your own variables. You can do so by running the following.
cp .env.example .env
Then update the companion .env file accordingly, replacing your Galileo API Key, Galileo Project, and Galileo Project name with the respective variables.
When complete, your .env values should look something like this:
# Example .env file — copy this file to .env and fill in the values. # Be sure to add the .env file to your .gitignore file. # LLM API Key (required) # For regular keys: sk-... # For project-based keys: sk-proj-... OPENAI_API_KEY=your-openai-api-key-here # OpenAI Project ID (optional for project-based keys; will be auto-extracted if not set) # OPENAI_PROJECT_ID=your-openai-project-id-here # Galileo Details (required for Galileo observability) GALILEO_API_KEY=your-galileo-api-key-here GALILEO_PROJECT=your project name here GALILEO_LOG_STREAM=my_log_stream # Optional LLM configuration LLM_MODEL=gpt-4 LLM_TEMPERATURE=0.7 # Optional agent configuration VERBOSITY=low # Options: none, low, high ENVIRONMENT=development ENABLE_LOGGING=true ENABLE_TOOL_SELECTION=true
Agent reliability and observability
In this example, the application already has Galileo built in ready to observe the application using the @log decorator as well as the GalileoLogger.
Check out the agent.py file to see Galileo’s implementation in practice, I’ll call out specific areas below.
As with many other SDKs, Galileo needs to be first initialized to prepare it for use. It's from within this step that you’ll set your configuration details (project name and log stream) for Galileo.
See below how the Galileo Logger is initialized in our agent.py file.
# Initialize Galileo Logger for this agent execution galileo_logger = GalileoLogger( project=os.environ.get("GALILEO_PROJECT"), log_stream=os.environ.get("GALILEO_LOG_STREAM") )
Once the SDK is initialized, you’ll create a workflow which will capture the relevant tool calls.
Starting the workflow will look something like this:
# Start the main agent trace - this is the parent trace for the entire workflow trace = galileo_logger.start_trace(f"agent_workflow_{self.mode}")
Now, having a workflow is great, and our agent.py file tells the LLM what tools to use when, but it doesn’t actually have the tools itself included.
For the tools itself, navigate into each respective tool file (such as ` /tools/news_api_tool.py`) where you’ll notice we’re using the @log decorator to define a span type + give it a name.
This decorator will create a tool span for HTTP API calls Since this tool makes HTTP requests to NewsAPI (not LLM calls), we use span_type="tool". Thus, the name "Tool-NewsAPI" will appear in your Galileo dashboard as a tool span.
See this in action around line 58 of the /tools/news_api_tool.py file:
@log(span_type="tool", name="Tool-NewsAPI") async def execute(self, ...):
In this application, we will also have an LLM Span. The Startup Simulator 3000 leverages an LLM span to compile the output from different tools into one comical output.
Similar to the tool span, an LLM Span will call the Galileo Logger. However, in this instance, because we want to have further customization around model choice and metadata, we are using the manual logger vs. the @log decorator.
Check out how this is put into action around line 83 in the /tools/startup_simulator.py file.
logger.add_llm_span( input=f"Generate startup pitch for {industry} targeting {audience} with word '{random_word}'", output="Tool execution started", model="startup_simulator", num_input_tokens=len(str(inputs)), num_output_tokens=0, total_tokens=len(str(inputs)), duration_ns=0 )
Run the application
After your environment variables are set, you are all set to run the application. The application is designed as a Flask Application, with a JavaScript frontend, and Python backend.
The standard flow of the application is as follows:
User Input → Flask Route → Agent Coordinator → Tool Chain → AI Model → Response
Run the application locally by running the following in the terminal.
python web_server.py
Navigate to http://localhost:2021
You should be presented with a screen that looks like this:

Try both modes:
🎭 Silly Mode for playful pitches
💼 Serious Mode for professional ideas
Once you’ve tested it a few times, pull it up in Galileo and evaluate the logs; see what’s happening under the hood so to speak.
Navigate to the relevant project by going to app.galileo.ai, logging in with your correct credentials, then selecting the project from the list of projects on the home screen.
💡Helpful tip! Bookmark your projects for future reference in order to access them easier.
Once selected, select the correct log stream, and explore different sessions by clicking into the log streams. You should see something like this when created.

Creating Custom Metrics
Out of the box, observability tools can show you that your AI application is running—what data is flowing where, how tools are firing, and whether you’re getting outputs. But just knowing that something works isn’t enough.
What we really need to know is how well does the application work for my domain?
Fortunately, this is where subject-matter experts (SME) becomes a secret weapon. Outside my day job at Galileo, I’m also a stand-up comedian trained at UCB and The PIT. I know how to evaluate timing, delivery, and rhythm. I understand comedic structures and joke arcs. That knowledge helps me define what funny looks like in AI output—and log custom metrics to reflect that.
But what if your domain isn't comedy?
That’s more than okay—because custom metrics work best when they’re grounded in your domain knowledge, whatever that might be.
Not building a comedy app? No problem. The key is knowing what “good” looks like in your world—and turning that into a measurable signal.
Here are a few examples:
Radiology
Measure: Diagnostic accuracy, false positives/negatives
SME: A radiologist knows what a correct diagnosis looks like.
Education
Measure: Reading level, clarity, coverage of learning objectives
SME: A teacher can tell if the explanation actually teaches.
Customer Support
Measure: Resolution rate, escalation flags, helpfulness
SME: A support lead knows what a great customer response sounds like.
Finance
Measure: Regulatory compliance (e.g. SEC language), risk flagging and disclaimers
SME: A financial analyst or compliance officer knows if the language used is accurate, risk-aware, and formatted to firm standards.
You don’t need a background in data science—just expertise in your domain. If you can point to what “good” looks like, you can likely measure it.
👀 PSST — want to learn more about custom metrics? Check out the replay of a recent webinar from Jim Bob Bennett and Roie Schwaber-Cohen where they take a deep-dive into all things custom metrics and how they improve AI reliability.
So how do we create a custom metric?
Right, back to the mission at hand here. If you’re at this point, you’ve got your app working, you’ve got logs and traces showing in Galileo, and you’re ready for the next step.
Let’s first start by navigating to our project home inside of Galileo.
I named my project erin-custom-metric
but look for whatever name you gave your project and open it up to the log stream you’ve got your traces in.
Click on the “trace” view and see your most recent runs listed below, it should look something like this

From this view, navigate to the upper right hand side of your screen and click on the `Configure Metrics` button. A side panel should appear with a set of different metrics from you to choose from.

Once in this panel, navigate to the “Create Metric” in the upper right hand corner of your screen, and select “LLM-as-a-Judge” Metric.

A window will appear where you will be able to generate a metric prompt. A metric prompt is a prompt to prompt the creation of the final LLM as a judge prompt. This is to ensure you spend your time focused on what’s important (the success criteria) instead of worrying about the output format.
When writing a good prompt, remember that the goal is to transform subjective evaluation criteria into a consistent, repeatable process that a language model can assess.
Here are some tips that you can use to write a good LLM-as-a-Judge Prompt:
Define the Role and Context: Tell the LLM what role it’s playing and what it is evaluating, be specific. This sets the tone and domain so the LLM knows what type of responses to expect.
Example: You are an expert humor judge specializing in startup culture, satire, and tech industry parody.
List Clear Evaluation Criteria: Break your metric into clear, checkable sections. Use structured criteria with simple TRUE/FALSE scoring where possible. This improves consistency and reduces ambiguity.
Example:
Satire Effectiveness: Content clearly parodies startup culture, humor is recognizable to tech-savvy audiences, balances absurdity with believability
Define a Scoring Threshold: Add a final rule section that tells the LLM how to make a final judgment.
Example: Mark the response as successful if:
At least 80% of all criteria are TRUE
None of the critical categories (e.g. Satire Effectiveness, Humor Consistency) are FALSE
The content would be considered funny or insightful by the target audience
For the sake of this tutorial, we’ll use this established Sillyness and Satire metric I’ve created, note how it follows the above best practices.

You are an expert humor judge specializing in startup culture satire and tech industry parody. Your role is to evaluate the humor and effectiveness of startup-related content generated by an AI system. EVALUATION CRITERIA: For each criterion, answer TRUE if the content meets the standard, FALSE if it doesn't. 1. SATIRE EFFECTIVENESS - [ ] Content clearly parodies startup culture tropes - [ ] Parody is recognizable to tech industry insiders - [ ] Maintains balance between believable and absurd - [ ] Successfully mocks common startup practices 2. HUMOR CONSISTENCY - [ ] Humor level remains consistent throughout - [ ] No significant drops in comedic quality - [ ] Tone remains appropriate for satire - [ ] Jokes build upon each other effectively 3. CULTURAL RELEVANCE - [ ] References are current and timely - [ ] Captures current startup culture trends - [ ] Buzzwords are accurately parodied - [ ] Industry-specific knowledge is evident 4. NARRATIVE COHERENCE - [ ] Story follows internal logic - [ ] Pivots make sense within context - [ ] Character/voice remains consistent - [ ] Plot points connect logically 5. ORIGINALITY - [ ] Avoids overused startup jokes - [ ] Contains unique elements - [ ] Offers fresh perspective - [ ] Surprises the audience 6. TECHNICAL ACCURACY - [ ] Startup concepts are correctly parodied - [ ] Industry terminology is used appropriately - [ ] Business concepts are accurately mocked - [ ] Technical details are correctly referenced Answer TRUE only if ALL of the following conditions are met: - [ ] At least 80% of all criteria are rated TRUE - [ ] No critical criteria (Satire Effectiveness, Humor Consistency) are rated FALSE - [ ] Content would be considered funny by the target audience - [ ] Satire successfully achieves its intended purpose - [ ] Content maintains appropriate tone throughout
When you’ve determined your prompt, press ‘save’ then test your metric. The evaluation prompt will then be generated (automagically, even✨) and you will be able to see it in a preview.
From within the Custom Metric UI, select Test Metric.

Take the output from an earlier run, and paste it in the output section of the ‘test metric’ page, and check the response.

Continue tweaking the prompt until you have a metric that you feel confident with — the goal isn’t to have a metric that is perfect 100% of the time, but helps you determine what “good” looks like.
Have examples of what a subject matter expert would consider to be “good” and “bad” to test your metric for success.
Once tested, your metric will appear in the list of available metrics. Click on “Configure Metrics” and flip the toggle the metric you’ve created, on. From there, it can be used to assess future LLM outputs across runs and sessions—giving you visibility into quality over time.

As new runs come in, you’ll be able to quickly identify which tool chains, prompts, or model variants produce better (or worse) results. This is especially helpful for debugging “feels off” outputs that don’t trigger hard failures. No longer will you rely on vibes to ensure AI is put into production safely and securely.
From startup satire to serious signal
Sure, this Startup Sim 3000 might make you laugh—but under the hood, you just built something genuinely powerful (and hopefully had some fun and learned valuable skills as well!)
You didn’t just generate funny fake startup pitches. You:
Structured an agent-based AI system
Logged tool and model activity with Galileo
Created a custom metric to judge the quality of your output
Learned how to translate fuzzy, subjective ideas (like “humor”) into measurable, testable signals
And that’s the big idea:
AI quality is contextual—so your metrics should be too.
Whether you're working with comedy, contracts, curriculums, or customer support, success isn’t binary. It’s “it depends on context” And the only way to answer that at scale is with custom metrics grounded in your own domain knowledge.
That’s how you move from:
“It runs” → “It works well”
“Cool demo” → “Useful product”
“Kinda funny” → “Funny enough to ship”
For more tutorials like this, follow along at galileo.ai or check out our cookbooks. In the meantime, if you’ve got a joke to share (or if you’ve improved the Startup Simulator 3000), find me on GitHub, drop me an email, or toot at me on BlueSky.
Ever dreamed of building an AI that can pitch a startup idea—either as a polished VC-ready business plan or something so absurd it belongs on a meme board? (Oh, and have some way to prove that it works)
In this hands-on tutorial, you’ll build Startup Sim 3000: a Python-powered web app that uses real-time data and large language models (LLMs) to generate creative or professional startup pitches.
But here's the twist: this isn’t just about generating content. You'll also learn how to track, monitor, and measure your system using Agent Reliability tools and custom metrics with Galileo.
What you’ll learn
By the end of this tutorial, you'll have:
Built an AI Agent system that combines multiple tools
Logged tool spans and LLM spans with Galileo
Tracked custom LLM-as-a-Judge Metrics to evaluate AI behavior
But the biggest takeaway? Learning the importance of custom metrics for domain-specific AI Applications, and how to define and measure success using these custom metrics.
> PSST — This application was also featured as a talk at DataBrick’s 2025 Data and AI Conference.
Why custom metrics matter
Large language models (LLMs) are inherently nondeterministic—they don’t behave the same way every time, and that unpredictability is a feature, not a flaw. It’s what makes them powerful for creative and complex tasks. But it also makes these systems hard to evaluate using traditional software metrics like “test pass/fail” or “code coverage.”
It’s one thing to build an AI tool. It’s another thing entirely to know whether it’s doing what you actually want—especially in domain-specific use cases like comedy, medical, fintech, education, creative writing, or business ideation. In these spaces, there's rarely a single “correct” output. The value of an answer is contextual, not binary.
That’s where custom metrics come in. Instead of asking “Did it work?” you can ask:
Which tools are used most?
Are the LLM responses too long?
Is this successful at completing [domain specific] task?
What versions are most accurate?
These are the kinds of questions that matter—but they can’t be answered with traditional developer metrics alone.
With Galileo, you can define and track these domain-specific signals directly. You’ll see what’s working, what’s not, and where your system needs tuning. It’s how you turn a cool demo into a production-ready product.
In this tutorial, we’ll take that idea for a spin in a playful way—by evaluating humor. You’ll learn how to apply custom metrics to a comedy-generating app and measure success based on timing, tone, delivery, and more.
Custom metrics are the bridge between "interesting in theory" and "effective in production."
What you’ll need
Before you get started, have the following on hand.
Some familiarity with Python/Flask
Python Package Manager of choice (we’ll be using uv)
Code editor of choice (VS Code, Cursor, Warp, etc.)
Set up your project
For the sake of jumping right into action — we’ll be starting from an existing application and demonstrating how to add custom metrics to an existing application.
If you’re wanting to explore how to add Galileo to a new agentic application — check out this tutorial on how to create a Weather Vibes Agent or our cookbooks that will walk you through step-by-step how to get started with Galileo.
Set up your Galileo project
If you haven’t already, create a free Galileo account on app.galileo.ai. When prompted, add an organization name. To dive right into this tutorial, you can skip past the onboarding screen by clicking on the Galileo Logo in the upper left hand corner.

NOTE: you will not be able to come back to this screen again, however there are helpful instructions to getting started in the Galileo Docs.
Create a new project by clicking on the ‘new project’ button on the upper right hand screen. You will be prompted to add a project name, as well as a log stream name.
Once that is created. Click on the profile icon in the upper right hand side of the page, navigate on the drop-down menu to API keys. From the API Keys screen, select ‘Create New Key’. Save the key somewhere safe (and secure) for later.

Clone the repo and install dependencies with uv
From within your terminal, run the following:
git clone https://github.com/erinmikailstaples/startup-sim-3000.git
Set up a virtual environment and install dependencies with uv
A virtual environment keeps your project’s dependencies isolated from your global Python installation. For this we’ll be using uv.
On Windows
uv venv source .venv\Scripts\activate uv pip install -r
On MacOS/Linux
uv venv source .venv/bin/activate uv pip install -r
This creates and activates a virtual environment for your project, then installs the necessary requirements.
Set up your environment variables
Grab the example .env file (.env.example) and copy it, preparing to add your own variables. You can do so by running the following.
cp .env.example .env
Then update the companion .env file accordingly, replacing your Galileo API Key, Galileo Project, and Galileo Project name with the respective variables.
When complete, your .env values should look something like this:
# Example .env file — copy this file to .env and fill in the values. # Be sure to add the .env file to your .gitignore file. # LLM API Key (required) # For regular keys: sk-... # For project-based keys: sk-proj-... OPENAI_API_KEY=your-openai-api-key-here # OpenAI Project ID (optional for project-based keys; will be auto-extracted if not set) # OPENAI_PROJECT_ID=your-openai-project-id-here # Galileo Details (required for Galileo observability) GALILEO_API_KEY=your-galileo-api-key-here GALILEO_PROJECT=your project name here GALILEO_LOG_STREAM=my_log_stream # Optional LLM configuration LLM_MODEL=gpt-4 LLM_TEMPERATURE=0.7 # Optional agent configuration VERBOSITY=low # Options: none, low, high ENVIRONMENT=development ENABLE_LOGGING=true ENABLE_TOOL_SELECTION=true
Agent reliability and observability
In this example, the application already has Galileo built in ready to observe the application using the @log decorator as well as the GalileoLogger.
Check out the agent.py file to see Galileo’s implementation in practice, I’ll call out specific areas below.
As with many other SDKs, Galileo needs to be first initialized to prepare it for use. It's from within this step that you’ll set your configuration details (project name and log stream) for Galileo.
See below how the Galileo Logger is initialized in our agent.py file.
# Initialize Galileo Logger for this agent execution galileo_logger = GalileoLogger( project=os.environ.get("GALILEO_PROJECT"), log_stream=os.environ.get("GALILEO_LOG_STREAM") )
Once the SDK is initialized, you’ll create a workflow which will capture the relevant tool calls.
Starting the workflow will look something like this:
# Start the main agent trace - this is the parent trace for the entire workflow trace = galileo_logger.start_trace(f"agent_workflow_{self.mode}")
Now, having a workflow is great, and our agent.py file tells the LLM what tools to use when, but it doesn’t actually have the tools itself included.
For the tools itself, navigate into each respective tool file (such as ` /tools/news_api_tool.py`) where you’ll notice we’re using the @log decorator to define a span type + give it a name.
This decorator will create a tool span for HTTP API calls Since this tool makes HTTP requests to NewsAPI (not LLM calls), we use span_type="tool". Thus, the name "Tool-NewsAPI" will appear in your Galileo dashboard as a tool span.
See this in action around line 58 of the /tools/news_api_tool.py file:
@log(span_type="tool", name="Tool-NewsAPI") async def execute(self, ...):
In this application, we will also have an LLM Span. The Startup Simulator 3000 leverages an LLM span to compile the output from different tools into one comical output.
Similar to the tool span, an LLM Span will call the Galileo Logger. However, in this instance, because we want to have further customization around model choice and metadata, we are using the manual logger vs. the @log decorator.
Check out how this is put into action around line 83 in the /tools/startup_simulator.py file.
logger.add_llm_span( input=f"Generate startup pitch for {industry} targeting {audience} with word '{random_word}'", output="Tool execution started", model="startup_simulator", num_input_tokens=len(str(inputs)), num_output_tokens=0, total_tokens=len(str(inputs)), duration_ns=0 )
Run the application
After your environment variables are set, you are all set to run the application. The application is designed as a Flask Application, with a JavaScript frontend, and Python backend.
The standard flow of the application is as follows:
User Input → Flask Route → Agent Coordinator → Tool Chain → AI Model → Response
Run the application locally by running the following in the terminal.
python web_server.py
Navigate to http://localhost:2021
You should be presented with a screen that looks like this:

Try both modes:
🎭 Silly Mode for playful pitches
💼 Serious Mode for professional ideas
Once you’ve tested it a few times, pull it up in Galileo and evaluate the logs; see what’s happening under the hood so to speak.
Navigate to the relevant project by going to app.galileo.ai, logging in with your correct credentials, then selecting the project from the list of projects on the home screen.
💡Helpful tip! Bookmark your projects for future reference in order to access them easier.
Once selected, select the correct log stream, and explore different sessions by clicking into the log streams. You should see something like this when created.

Creating Custom Metrics
Out of the box, observability tools can show you that your AI application is running—what data is flowing where, how tools are firing, and whether you’re getting outputs. But just knowing that something works isn’t enough.
What we really need to know is how well does the application work for my domain?
Fortunately, this is where subject-matter experts (SME) becomes a secret weapon. Outside my day job at Galileo, I’m also a stand-up comedian trained at UCB and The PIT. I know how to evaluate timing, delivery, and rhythm. I understand comedic structures and joke arcs. That knowledge helps me define what funny looks like in AI output—and log custom metrics to reflect that.
But what if your domain isn't comedy?
That’s more than okay—because custom metrics work best when they’re grounded in your domain knowledge, whatever that might be.
Not building a comedy app? No problem. The key is knowing what “good” looks like in your world—and turning that into a measurable signal.
Here are a few examples:
Radiology
Measure: Diagnostic accuracy, false positives/negatives
SME: A radiologist knows what a correct diagnosis looks like.
Education
Measure: Reading level, clarity, coverage of learning objectives
SME: A teacher can tell if the explanation actually teaches.
Customer Support
Measure: Resolution rate, escalation flags, helpfulness
SME: A support lead knows what a great customer response sounds like.
Finance
Measure: Regulatory compliance (e.g. SEC language), risk flagging and disclaimers
SME: A financial analyst or compliance officer knows if the language used is accurate, risk-aware, and formatted to firm standards.
You don’t need a background in data science—just expertise in your domain. If you can point to what “good” looks like, you can likely measure it.
👀 PSST — want to learn more about custom metrics? Check out the replay of a recent webinar from Jim Bob Bennett and Roie Schwaber-Cohen where they take a deep-dive into all things custom metrics and how they improve AI reliability.
So how do we create a custom metric?
Right, back to the mission at hand here. If you’re at this point, you’ve got your app working, you’ve got logs and traces showing in Galileo, and you’re ready for the next step.
Let’s first start by navigating to our project home inside of Galileo.
I named my project erin-custom-metric
but look for whatever name you gave your project and open it up to the log stream you’ve got your traces in.
Click on the “trace” view and see your most recent runs listed below, it should look something like this

From this view, navigate to the upper right hand side of your screen and click on the `Configure Metrics` button. A side panel should appear with a set of different metrics from you to choose from.

Once in this panel, navigate to the “Create Metric” in the upper right hand corner of your screen, and select “LLM-as-a-Judge” Metric.

A window will appear where you will be able to generate a metric prompt. A metric prompt is a prompt to prompt the creation of the final LLM as a judge prompt. This is to ensure you spend your time focused on what’s important (the success criteria) instead of worrying about the output format.
When writing a good prompt, remember that the goal is to transform subjective evaluation criteria into a consistent, repeatable process that a language model can assess.
Here are some tips that you can use to write a good LLM-as-a-Judge Prompt:
Define the Role and Context: Tell the LLM what role it’s playing and what it is evaluating, be specific. This sets the tone and domain so the LLM knows what type of responses to expect.
Example: You are an expert humor judge specializing in startup culture, satire, and tech industry parody.
List Clear Evaluation Criteria: Break your metric into clear, checkable sections. Use structured criteria with simple TRUE/FALSE scoring where possible. This improves consistency and reduces ambiguity.
Example:
Satire Effectiveness: Content clearly parodies startup culture, humor is recognizable to tech-savvy audiences, balances absurdity with believability
Define a Scoring Threshold: Add a final rule section that tells the LLM how to make a final judgment.
Example: Mark the response as successful if:
At least 80% of all criteria are TRUE
None of the critical categories (e.g. Satire Effectiveness, Humor Consistency) are FALSE
The content would be considered funny or insightful by the target audience
For the sake of this tutorial, we’ll use this established Sillyness and Satire metric I’ve created, note how it follows the above best practices.

You are an expert humor judge specializing in startup culture satire and tech industry parody. Your role is to evaluate the humor and effectiveness of startup-related content generated by an AI system. EVALUATION CRITERIA: For each criterion, answer TRUE if the content meets the standard, FALSE if it doesn't. 1. SATIRE EFFECTIVENESS - [ ] Content clearly parodies startup culture tropes - [ ] Parody is recognizable to tech industry insiders - [ ] Maintains balance between believable and absurd - [ ] Successfully mocks common startup practices 2. HUMOR CONSISTENCY - [ ] Humor level remains consistent throughout - [ ] No significant drops in comedic quality - [ ] Tone remains appropriate for satire - [ ] Jokes build upon each other effectively 3. CULTURAL RELEVANCE - [ ] References are current and timely - [ ] Captures current startup culture trends - [ ] Buzzwords are accurately parodied - [ ] Industry-specific knowledge is evident 4. NARRATIVE COHERENCE - [ ] Story follows internal logic - [ ] Pivots make sense within context - [ ] Character/voice remains consistent - [ ] Plot points connect logically 5. ORIGINALITY - [ ] Avoids overused startup jokes - [ ] Contains unique elements - [ ] Offers fresh perspective - [ ] Surprises the audience 6. TECHNICAL ACCURACY - [ ] Startup concepts are correctly parodied - [ ] Industry terminology is used appropriately - [ ] Business concepts are accurately mocked - [ ] Technical details are correctly referenced Answer TRUE only if ALL of the following conditions are met: - [ ] At least 80% of all criteria are rated TRUE - [ ] No critical criteria (Satire Effectiveness, Humor Consistency) are rated FALSE - [ ] Content would be considered funny by the target audience - [ ] Satire successfully achieves its intended purpose - [ ] Content maintains appropriate tone throughout
When you’ve determined your prompt, press ‘save’ then test your metric. The evaluation prompt will then be generated (automagically, even✨) and you will be able to see it in a preview.
From within the Custom Metric UI, select Test Metric.

Take the output from an earlier run, and paste it in the output section of the ‘test metric’ page, and check the response.

Continue tweaking the prompt until you have a metric that you feel confident with — the goal isn’t to have a metric that is perfect 100% of the time, but helps you determine what “good” looks like.
Have examples of what a subject matter expert would consider to be “good” and “bad” to test your metric for success.
Once tested, your metric will appear in the list of available metrics. Click on “Configure Metrics” and flip the toggle the metric you’ve created, on. From there, it can be used to assess future LLM outputs across runs and sessions—giving you visibility into quality over time.

As new runs come in, you’ll be able to quickly identify which tool chains, prompts, or model variants produce better (or worse) results. This is especially helpful for debugging “feels off” outputs that don’t trigger hard failures. No longer will you rely on vibes to ensure AI is put into production safely and securely.
From startup satire to serious signal
Sure, this Startup Sim 3000 might make you laugh—but under the hood, you just built something genuinely powerful (and hopefully had some fun and learned valuable skills as well!)
You didn’t just generate funny fake startup pitches. You:
Structured an agent-based AI system
Logged tool and model activity with Galileo
Created a custom metric to judge the quality of your output
Learned how to translate fuzzy, subjective ideas (like “humor”) into measurable, testable signals
And that’s the big idea:
AI quality is contextual—so your metrics should be too.
Whether you're working with comedy, contracts, curriculums, or customer support, success isn’t binary. It’s “it depends on context” And the only way to answer that at scale is with custom metrics grounded in your own domain knowledge.
That’s how you move from:
“It runs” → “It works well”
“Cool demo” → “Useful product”
“Kinda funny” → “Funny enough to ship”
For more tutorials like this, follow along at galileo.ai or check out our cookbooks. In the meantime, if you’ve got a joke to share (or if you’ve improved the Startup Simulator 3000), find me on GitHub, drop me an email, or toot at me on BlueSky.
Ever dreamed of building an AI that can pitch a startup idea—either as a polished VC-ready business plan or something so absurd it belongs on a meme board? (Oh, and have some way to prove that it works)
In this hands-on tutorial, you’ll build Startup Sim 3000: a Python-powered web app that uses real-time data and large language models (LLMs) to generate creative or professional startup pitches.
But here's the twist: this isn’t just about generating content. You'll also learn how to track, monitor, and measure your system using Agent Reliability tools and custom metrics with Galileo.
What you’ll learn
By the end of this tutorial, you'll have:
Built an AI Agent system that combines multiple tools
Logged tool spans and LLM spans with Galileo
Tracked custom LLM-as-a-Judge Metrics to evaluate AI behavior
But the biggest takeaway? Learning the importance of custom metrics for domain-specific AI Applications, and how to define and measure success using these custom metrics.
> PSST — This application was also featured as a talk at DataBrick’s 2025 Data and AI Conference.
Why custom metrics matter
Large language models (LLMs) are inherently nondeterministic—they don’t behave the same way every time, and that unpredictability is a feature, not a flaw. It’s what makes them powerful for creative and complex tasks. But it also makes these systems hard to evaluate using traditional software metrics like “test pass/fail” or “code coverage.”
It’s one thing to build an AI tool. It’s another thing entirely to know whether it’s doing what you actually want—especially in domain-specific use cases like comedy, medical, fintech, education, creative writing, or business ideation. In these spaces, there's rarely a single “correct” output. The value of an answer is contextual, not binary.
That’s where custom metrics come in. Instead of asking “Did it work?” you can ask:
Which tools are used most?
Are the LLM responses too long?
Is this successful at completing [domain specific] task?
What versions are most accurate?
These are the kinds of questions that matter—but they can’t be answered with traditional developer metrics alone.
With Galileo, you can define and track these domain-specific signals directly. You’ll see what’s working, what’s not, and where your system needs tuning. It’s how you turn a cool demo into a production-ready product.
In this tutorial, we’ll take that idea for a spin in a playful way—by evaluating humor. You’ll learn how to apply custom metrics to a comedy-generating app and measure success based on timing, tone, delivery, and more.
Custom metrics are the bridge between "interesting in theory" and "effective in production."
What you’ll need
Before you get started, have the following on hand.
Some familiarity with Python/Flask
Python Package Manager of choice (we’ll be using uv)
Code editor of choice (VS Code, Cursor, Warp, etc.)
Set up your project
For the sake of jumping right into action — we’ll be starting from an existing application and demonstrating how to add custom metrics to an existing application.
If you’re wanting to explore how to add Galileo to a new agentic application — check out this tutorial on how to create a Weather Vibes Agent or our cookbooks that will walk you through step-by-step how to get started with Galileo.
Set up your Galileo project
If you haven’t already, create a free Galileo account on app.galileo.ai. When prompted, add an organization name. To dive right into this tutorial, you can skip past the onboarding screen by clicking on the Galileo Logo in the upper left hand corner.

NOTE: you will not be able to come back to this screen again, however there are helpful instructions to getting started in the Galileo Docs.
Create a new project by clicking on the ‘new project’ button on the upper right hand screen. You will be prompted to add a project name, as well as a log stream name.
Once that is created. Click on the profile icon in the upper right hand side of the page, navigate on the drop-down menu to API keys. From the API Keys screen, select ‘Create New Key’. Save the key somewhere safe (and secure) for later.

Clone the repo and install dependencies with uv
From within your terminal, run the following:
git clone https://github.com/erinmikailstaples/startup-sim-3000.git
Set up a virtual environment and install dependencies with uv
A virtual environment keeps your project’s dependencies isolated from your global Python installation. For this we’ll be using uv.
On Windows
uv venv source .venv\Scripts\activate uv pip install -r
On MacOS/Linux
uv venv source .venv/bin/activate uv pip install -r
This creates and activates a virtual environment for your project, then installs the necessary requirements.
Set up your environment variables
Grab the example .env file (.env.example) and copy it, preparing to add your own variables. You can do so by running the following.
cp .env.example .env
Then update the companion .env file accordingly, replacing your Galileo API Key, Galileo Project, and Galileo Project name with the respective variables.
When complete, your .env values should look something like this:
# Example .env file — copy this file to .env and fill in the values. # Be sure to add the .env file to your .gitignore file. # LLM API Key (required) # For regular keys: sk-... # For project-based keys: sk-proj-... OPENAI_API_KEY=your-openai-api-key-here # OpenAI Project ID (optional for project-based keys; will be auto-extracted if not set) # OPENAI_PROJECT_ID=your-openai-project-id-here # Galileo Details (required for Galileo observability) GALILEO_API_KEY=your-galileo-api-key-here GALILEO_PROJECT=your project name here GALILEO_LOG_STREAM=my_log_stream # Optional LLM configuration LLM_MODEL=gpt-4 LLM_TEMPERATURE=0.7 # Optional agent configuration VERBOSITY=low # Options: none, low, high ENVIRONMENT=development ENABLE_LOGGING=true ENABLE_TOOL_SELECTION=true
Agent reliability and observability
In this example, the application already has Galileo built in ready to observe the application using the @log decorator as well as the GalileoLogger.
Check out the agent.py file to see Galileo’s implementation in practice, I’ll call out specific areas below.
As with many other SDKs, Galileo needs to be first initialized to prepare it for use. It's from within this step that you’ll set your configuration details (project name and log stream) for Galileo.
See below how the Galileo Logger is initialized in our agent.py file.
# Initialize Galileo Logger for this agent execution galileo_logger = GalileoLogger( project=os.environ.get("GALILEO_PROJECT"), log_stream=os.environ.get("GALILEO_LOG_STREAM") )
Once the SDK is initialized, you’ll create a workflow which will capture the relevant tool calls.
Starting the workflow will look something like this:
# Start the main agent trace - this is the parent trace for the entire workflow trace = galileo_logger.start_trace(f"agent_workflow_{self.mode}")
Now, having a workflow is great, and our agent.py file tells the LLM what tools to use when, but it doesn’t actually have the tools itself included.
For the tools itself, navigate into each respective tool file (such as ` /tools/news_api_tool.py`) where you’ll notice we’re using the @log decorator to define a span type + give it a name.
This decorator will create a tool span for HTTP API calls Since this tool makes HTTP requests to NewsAPI (not LLM calls), we use span_type="tool". Thus, the name "Tool-NewsAPI" will appear in your Galileo dashboard as a tool span.
See this in action around line 58 of the /tools/news_api_tool.py file:
@log(span_type="tool", name="Tool-NewsAPI") async def execute(self, ...):
In this application, we will also have an LLM Span. The Startup Simulator 3000 leverages an LLM span to compile the output from different tools into one comical output.
Similar to the tool span, an LLM Span will call the Galileo Logger. However, in this instance, because we want to have further customization around model choice and metadata, we are using the manual logger vs. the @log decorator.
Check out how this is put into action around line 83 in the /tools/startup_simulator.py file.
logger.add_llm_span( input=f"Generate startup pitch for {industry} targeting {audience} with word '{random_word}'", output="Tool execution started", model="startup_simulator", num_input_tokens=len(str(inputs)), num_output_tokens=0, total_tokens=len(str(inputs)), duration_ns=0 )
Run the application
After your environment variables are set, you are all set to run the application. The application is designed as a Flask Application, with a JavaScript frontend, and Python backend.
The standard flow of the application is as follows:
User Input → Flask Route → Agent Coordinator → Tool Chain → AI Model → Response
Run the application locally by running the following in the terminal.
python web_server.py
Navigate to http://localhost:2021
You should be presented with a screen that looks like this:

Try both modes:
🎭 Silly Mode for playful pitches
💼 Serious Mode for professional ideas
Once you’ve tested it a few times, pull it up in Galileo and evaluate the logs; see what’s happening under the hood so to speak.
Navigate to the relevant project by going to app.galileo.ai, logging in with your correct credentials, then selecting the project from the list of projects on the home screen.
💡Helpful tip! Bookmark your projects for future reference in order to access them easier.
Once selected, select the correct log stream, and explore different sessions by clicking into the log streams. You should see something like this when created.

Creating Custom Metrics
Out of the box, observability tools can show you that your AI application is running—what data is flowing where, how tools are firing, and whether you’re getting outputs. But just knowing that something works isn’t enough.
What we really need to know is how well does the application work for my domain?
Fortunately, this is where subject-matter experts (SME) becomes a secret weapon. Outside my day job at Galileo, I’m also a stand-up comedian trained at UCB and The PIT. I know how to evaluate timing, delivery, and rhythm. I understand comedic structures and joke arcs. That knowledge helps me define what funny looks like in AI output—and log custom metrics to reflect that.
But what if your domain isn't comedy?
That’s more than okay—because custom metrics work best when they’re grounded in your domain knowledge, whatever that might be.
Not building a comedy app? No problem. The key is knowing what “good” looks like in your world—and turning that into a measurable signal.
Here are a few examples:
Radiology
Measure: Diagnostic accuracy, false positives/negatives
SME: A radiologist knows what a correct diagnosis looks like.
Education
Measure: Reading level, clarity, coverage of learning objectives
SME: A teacher can tell if the explanation actually teaches.
Customer Support
Measure: Resolution rate, escalation flags, helpfulness
SME: A support lead knows what a great customer response sounds like.
Finance
Measure: Regulatory compliance (e.g. SEC language), risk flagging and disclaimers
SME: A financial analyst or compliance officer knows if the language used is accurate, risk-aware, and formatted to firm standards.
You don’t need a background in data science—just expertise in your domain. If you can point to what “good” looks like, you can likely measure it.
👀 PSST — want to learn more about custom metrics? Check out the replay of a recent webinar from Jim Bob Bennett and Roie Schwaber-Cohen where they take a deep-dive into all things custom metrics and how they improve AI reliability.
So how do we create a custom metric?
Right, back to the mission at hand here. If you’re at this point, you’ve got your app working, you’ve got logs and traces showing in Galileo, and you’re ready for the next step.
Let’s first start by navigating to our project home inside of Galileo.
I named my project erin-custom-metric
but look for whatever name you gave your project and open it up to the log stream you’ve got your traces in.
Click on the “trace” view and see your most recent runs listed below, it should look something like this

From this view, navigate to the upper right hand side of your screen and click on the `Configure Metrics` button. A side panel should appear with a set of different metrics from you to choose from.

Once in this panel, navigate to the “Create Metric” in the upper right hand corner of your screen, and select “LLM-as-a-Judge” Metric.

A window will appear where you will be able to generate a metric prompt. A metric prompt is a prompt to prompt the creation of the final LLM as a judge prompt. This is to ensure you spend your time focused on what’s important (the success criteria) instead of worrying about the output format.
When writing a good prompt, remember that the goal is to transform subjective evaluation criteria into a consistent, repeatable process that a language model can assess.
Here are some tips that you can use to write a good LLM-as-a-Judge Prompt:
Define the Role and Context: Tell the LLM what role it’s playing and what it is evaluating, be specific. This sets the tone and domain so the LLM knows what type of responses to expect.
Example: You are an expert humor judge specializing in startup culture, satire, and tech industry parody.
List Clear Evaluation Criteria: Break your metric into clear, checkable sections. Use structured criteria with simple TRUE/FALSE scoring where possible. This improves consistency and reduces ambiguity.
Example:
Satire Effectiveness: Content clearly parodies startup culture, humor is recognizable to tech-savvy audiences, balances absurdity with believability
Define a Scoring Threshold: Add a final rule section that tells the LLM how to make a final judgment.
Example: Mark the response as successful if:
At least 80% of all criteria are TRUE
None of the critical categories (e.g. Satire Effectiveness, Humor Consistency) are FALSE
The content would be considered funny or insightful by the target audience
For the sake of this tutorial, we’ll use this established Sillyness and Satire metric I’ve created, note how it follows the above best practices.

You are an expert humor judge specializing in startup culture satire and tech industry parody. Your role is to evaluate the humor and effectiveness of startup-related content generated by an AI system. EVALUATION CRITERIA: For each criterion, answer TRUE if the content meets the standard, FALSE if it doesn't. 1. SATIRE EFFECTIVENESS - [ ] Content clearly parodies startup culture tropes - [ ] Parody is recognizable to tech industry insiders - [ ] Maintains balance between believable and absurd - [ ] Successfully mocks common startup practices 2. HUMOR CONSISTENCY - [ ] Humor level remains consistent throughout - [ ] No significant drops in comedic quality - [ ] Tone remains appropriate for satire - [ ] Jokes build upon each other effectively 3. CULTURAL RELEVANCE - [ ] References are current and timely - [ ] Captures current startup culture trends - [ ] Buzzwords are accurately parodied - [ ] Industry-specific knowledge is evident 4. NARRATIVE COHERENCE - [ ] Story follows internal logic - [ ] Pivots make sense within context - [ ] Character/voice remains consistent - [ ] Plot points connect logically 5. ORIGINALITY - [ ] Avoids overused startup jokes - [ ] Contains unique elements - [ ] Offers fresh perspective - [ ] Surprises the audience 6. TECHNICAL ACCURACY - [ ] Startup concepts are correctly parodied - [ ] Industry terminology is used appropriately - [ ] Business concepts are accurately mocked - [ ] Technical details are correctly referenced Answer TRUE only if ALL of the following conditions are met: - [ ] At least 80% of all criteria are rated TRUE - [ ] No critical criteria (Satire Effectiveness, Humor Consistency) are rated FALSE - [ ] Content would be considered funny by the target audience - [ ] Satire successfully achieves its intended purpose - [ ] Content maintains appropriate tone throughout
When you’ve determined your prompt, press ‘save’ then test your metric. The evaluation prompt will then be generated (automagically, even✨) and you will be able to see it in a preview.
From within the Custom Metric UI, select Test Metric.

Take the output from an earlier run, and paste it in the output section of the ‘test metric’ page, and check the response.

Continue tweaking the prompt until you have a metric that you feel confident with — the goal isn’t to have a metric that is perfect 100% of the time, but helps you determine what “good” looks like.
Have examples of what a subject matter expert would consider to be “good” and “bad” to test your metric for success.
Once tested, your metric will appear in the list of available metrics. Click on “Configure Metrics” and flip the toggle the metric you’ve created, on. From there, it can be used to assess future LLM outputs across runs and sessions—giving you visibility into quality over time.

As new runs come in, you’ll be able to quickly identify which tool chains, prompts, or model variants produce better (or worse) results. This is especially helpful for debugging “feels off” outputs that don’t trigger hard failures. No longer will you rely on vibes to ensure AI is put into production safely and securely.
From startup satire to serious signal
Sure, this Startup Sim 3000 might make you laugh—but under the hood, you just built something genuinely powerful (and hopefully had some fun and learned valuable skills as well!)
You didn’t just generate funny fake startup pitches. You:
Structured an agent-based AI system
Logged tool and model activity with Galileo
Created a custom metric to judge the quality of your output
Learned how to translate fuzzy, subjective ideas (like “humor”) into measurable, testable signals
And that’s the big idea:
AI quality is contextual—so your metrics should be too.
Whether you're working with comedy, contracts, curriculums, or customer support, success isn’t binary. It’s “it depends on context” And the only way to answer that at scale is with custom metrics grounded in your own domain knowledge.
That’s how you move from:
“It runs” → “It works well”
“Cool demo” → “Useful product”
“Kinda funny” → “Funny enough to ship”
For more tutorials like this, follow along at galileo.ai or check out our cookbooks. In the meantime, if you’ve got a joke to share (or if you’ve improved the Startup Simulator 3000), find me on GitHub, drop me an email, or toot at me on BlueSky.
Ever dreamed of building an AI that can pitch a startup idea—either as a polished VC-ready business plan or something so absurd it belongs on a meme board? (Oh, and have some way to prove that it works)
In this hands-on tutorial, you’ll build Startup Sim 3000: a Python-powered web app that uses real-time data and large language models (LLMs) to generate creative or professional startup pitches.
But here's the twist: this isn’t just about generating content. You'll also learn how to track, monitor, and measure your system using Agent Reliability tools and custom metrics with Galileo.
What you’ll learn
By the end of this tutorial, you'll have:
Built an AI Agent system that combines multiple tools
Logged tool spans and LLM spans with Galileo
Tracked custom LLM-as-a-Judge Metrics to evaluate AI behavior
But the biggest takeaway? Learning the importance of custom metrics for domain-specific AI Applications, and how to define and measure success using these custom metrics.
> PSST — This application was also featured as a talk at DataBrick’s 2025 Data and AI Conference.
Why custom metrics matter
Large language models (LLMs) are inherently nondeterministic—they don’t behave the same way every time, and that unpredictability is a feature, not a flaw. It’s what makes them powerful for creative and complex tasks. But it also makes these systems hard to evaluate using traditional software metrics like “test pass/fail” or “code coverage.”
It’s one thing to build an AI tool. It’s another thing entirely to know whether it’s doing what you actually want—especially in domain-specific use cases like comedy, medical, fintech, education, creative writing, or business ideation. In these spaces, there's rarely a single “correct” output. The value of an answer is contextual, not binary.
That’s where custom metrics come in. Instead of asking “Did it work?” you can ask:
Which tools are used most?
Are the LLM responses too long?
Is this successful at completing [domain specific] task?
What versions are most accurate?
These are the kinds of questions that matter—but they can’t be answered with traditional developer metrics alone.
With Galileo, you can define and track these domain-specific signals directly. You’ll see what’s working, what’s not, and where your system needs tuning. It’s how you turn a cool demo into a production-ready product.
In this tutorial, we’ll take that idea for a spin in a playful way—by evaluating humor. You’ll learn how to apply custom metrics to a comedy-generating app and measure success based on timing, tone, delivery, and more.
Custom metrics are the bridge between "interesting in theory" and "effective in production."
What you’ll need
Before you get started, have the following on hand.
Some familiarity with Python/Flask
Python Package Manager of choice (we’ll be using uv)
Code editor of choice (VS Code, Cursor, Warp, etc.)
Set up your project
For the sake of jumping right into action — we’ll be starting from an existing application and demonstrating how to add custom metrics to an existing application.
If you’re wanting to explore how to add Galileo to a new agentic application — check out this tutorial on how to create a Weather Vibes Agent or our cookbooks that will walk you through step-by-step how to get started with Galileo.
Set up your Galileo project
If you haven’t already, create a free Galileo account on app.galileo.ai. When prompted, add an organization name. To dive right into this tutorial, you can skip past the onboarding screen by clicking on the Galileo Logo in the upper left hand corner.

NOTE: you will not be able to come back to this screen again, however there are helpful instructions to getting started in the Galileo Docs.
Create a new project by clicking on the ‘new project’ button on the upper right hand screen. You will be prompted to add a project name, as well as a log stream name.
Once that is created. Click on the profile icon in the upper right hand side of the page, navigate on the drop-down menu to API keys. From the API Keys screen, select ‘Create New Key’. Save the key somewhere safe (and secure) for later.

Clone the repo and install dependencies with uv
From within your terminal, run the following:
git clone https://github.com/erinmikailstaples/startup-sim-3000.git
Set up a virtual environment and install dependencies with uv
A virtual environment keeps your project’s dependencies isolated from your global Python installation. For this we’ll be using uv.
On Windows
uv venv source .venv\Scripts\activate uv pip install -r
On MacOS/Linux
uv venv source .venv/bin/activate uv pip install -r
This creates and activates a virtual environment for your project, then installs the necessary requirements.
Set up your environment variables
Grab the example .env file (.env.example) and copy it, preparing to add your own variables. You can do so by running the following.
cp .env.example .env
Then update the companion .env file accordingly, replacing your Galileo API Key, Galileo Project, and Galileo Project name with the respective variables.
When complete, your .env values should look something like this:
# Example .env file — copy this file to .env and fill in the values. # Be sure to add the .env file to your .gitignore file. # LLM API Key (required) # For regular keys: sk-... # For project-based keys: sk-proj-... OPENAI_API_KEY=your-openai-api-key-here # OpenAI Project ID (optional for project-based keys; will be auto-extracted if not set) # OPENAI_PROJECT_ID=your-openai-project-id-here # Galileo Details (required for Galileo observability) GALILEO_API_KEY=your-galileo-api-key-here GALILEO_PROJECT=your project name here GALILEO_LOG_STREAM=my_log_stream # Optional LLM configuration LLM_MODEL=gpt-4 LLM_TEMPERATURE=0.7 # Optional agent configuration VERBOSITY=low # Options: none, low, high ENVIRONMENT=development ENABLE_LOGGING=true ENABLE_TOOL_SELECTION=true
Agent reliability and observability
In this example, the application already has Galileo built in ready to observe the application using the @log decorator as well as the GalileoLogger.
Check out the agent.py file to see Galileo’s implementation in practice, I’ll call out specific areas below.
As with many other SDKs, Galileo needs to be first initialized to prepare it for use. It's from within this step that you’ll set your configuration details (project name and log stream) for Galileo.
See below how the Galileo Logger is initialized in our agent.py file.
# Initialize Galileo Logger for this agent execution galileo_logger = GalileoLogger( project=os.environ.get("GALILEO_PROJECT"), log_stream=os.environ.get("GALILEO_LOG_STREAM") )
Once the SDK is initialized, you’ll create a workflow which will capture the relevant tool calls.
Starting the workflow will look something like this:
# Start the main agent trace - this is the parent trace for the entire workflow trace = galileo_logger.start_trace(f"agent_workflow_{self.mode}")
Now, having a workflow is great, and our agent.py file tells the LLM what tools to use when, but it doesn’t actually have the tools itself included.
For the tools itself, navigate into each respective tool file (such as ` /tools/news_api_tool.py`) where you’ll notice we’re using the @log decorator to define a span type + give it a name.
This decorator will create a tool span for HTTP API calls Since this tool makes HTTP requests to NewsAPI (not LLM calls), we use span_type="tool". Thus, the name "Tool-NewsAPI" will appear in your Galileo dashboard as a tool span.
See this in action around line 58 of the /tools/news_api_tool.py file:
@log(span_type="tool", name="Tool-NewsAPI") async def execute(self, ...):
In this application, we will also have an LLM Span. The Startup Simulator 3000 leverages an LLM span to compile the output from different tools into one comical output.
Similar to the tool span, an LLM Span will call the Galileo Logger. However, in this instance, because we want to have further customization around model choice and metadata, we are using the manual logger vs. the @log decorator.
Check out how this is put into action around line 83 in the /tools/startup_simulator.py file.
logger.add_llm_span( input=f"Generate startup pitch for {industry} targeting {audience} with word '{random_word}'", output="Tool execution started", model="startup_simulator", num_input_tokens=len(str(inputs)), num_output_tokens=0, total_tokens=len(str(inputs)), duration_ns=0 )
Run the application
After your environment variables are set, you are all set to run the application. The application is designed as a Flask Application, with a JavaScript frontend, and Python backend.
The standard flow of the application is as follows:
User Input → Flask Route → Agent Coordinator → Tool Chain → AI Model → Response
Run the application locally by running the following in the terminal.
python web_server.py
Navigate to http://localhost:2021
You should be presented with a screen that looks like this:

Try both modes:
🎭 Silly Mode for playful pitches
💼 Serious Mode for professional ideas
Once you’ve tested it a few times, pull it up in Galileo and evaluate the logs; see what’s happening under the hood so to speak.
Navigate to the relevant project by going to app.galileo.ai, logging in with your correct credentials, then selecting the project from the list of projects on the home screen.
💡Helpful tip! Bookmark your projects for future reference in order to access them easier.
Once selected, select the correct log stream, and explore different sessions by clicking into the log streams. You should see something like this when created.

Creating Custom Metrics
Out of the box, observability tools can show you that your AI application is running—what data is flowing where, how tools are firing, and whether you’re getting outputs. But just knowing that something works isn’t enough.
What we really need to know is how well does the application work for my domain?
Fortunately, this is where subject-matter experts (SME) becomes a secret weapon. Outside my day job at Galileo, I’m also a stand-up comedian trained at UCB and The PIT. I know how to evaluate timing, delivery, and rhythm. I understand comedic structures and joke arcs. That knowledge helps me define what funny looks like in AI output—and log custom metrics to reflect that.
But what if your domain isn't comedy?
That’s more than okay—because custom metrics work best when they’re grounded in your domain knowledge, whatever that might be.
Not building a comedy app? No problem. The key is knowing what “good” looks like in your world—and turning that into a measurable signal.
Here are a few examples:
Radiology
Measure: Diagnostic accuracy, false positives/negatives
SME: A radiologist knows what a correct diagnosis looks like.
Education
Measure: Reading level, clarity, coverage of learning objectives
SME: A teacher can tell if the explanation actually teaches.
Customer Support
Measure: Resolution rate, escalation flags, helpfulness
SME: A support lead knows what a great customer response sounds like.
Finance
Measure: Regulatory compliance (e.g. SEC language), risk flagging and disclaimers
SME: A financial analyst or compliance officer knows if the language used is accurate, risk-aware, and formatted to firm standards.
You don’t need a background in data science—just expertise in your domain. If you can point to what “good” looks like, you can likely measure it.
👀 PSST — want to learn more about custom metrics? Check out the replay of a recent webinar from Jim Bob Bennett and Roie Schwaber-Cohen where they take a deep-dive into all things custom metrics and how they improve AI reliability.
So how do we create a custom metric?
Right, back to the mission at hand here. If you’re at this point, you’ve got your app working, you’ve got logs and traces showing in Galileo, and you’re ready for the next step.
Let’s first start by navigating to our project home inside of Galileo.
I named my project erin-custom-metric
but look for whatever name you gave your project and open it up to the log stream you’ve got your traces in.
Click on the “trace” view and see your most recent runs listed below, it should look something like this

From this view, navigate to the upper right hand side of your screen and click on the `Configure Metrics` button. A side panel should appear with a set of different metrics from you to choose from.

Once in this panel, navigate to the “Create Metric” in the upper right hand corner of your screen, and select “LLM-as-a-Judge” Metric.

A window will appear where you will be able to generate a metric prompt. A metric prompt is a prompt to prompt the creation of the final LLM as a judge prompt. This is to ensure you spend your time focused on what’s important (the success criteria) instead of worrying about the output format.
When writing a good prompt, remember that the goal is to transform subjective evaluation criteria into a consistent, repeatable process that a language model can assess.
Here are some tips that you can use to write a good LLM-as-a-Judge Prompt:
Define the Role and Context: Tell the LLM what role it’s playing and what it is evaluating, be specific. This sets the tone and domain so the LLM knows what type of responses to expect.
Example: You are an expert humor judge specializing in startup culture, satire, and tech industry parody.
List Clear Evaluation Criteria: Break your metric into clear, checkable sections. Use structured criteria with simple TRUE/FALSE scoring where possible. This improves consistency and reduces ambiguity.
Example:
Satire Effectiveness: Content clearly parodies startup culture, humor is recognizable to tech-savvy audiences, balances absurdity with believability
Define a Scoring Threshold: Add a final rule section that tells the LLM how to make a final judgment.
Example: Mark the response as successful if:
At least 80% of all criteria are TRUE
None of the critical categories (e.g. Satire Effectiveness, Humor Consistency) are FALSE
The content would be considered funny or insightful by the target audience
For the sake of this tutorial, we’ll use this established Sillyness and Satire metric I’ve created, note how it follows the above best practices.

You are an expert humor judge specializing in startup culture satire and tech industry parody. Your role is to evaluate the humor and effectiveness of startup-related content generated by an AI system. EVALUATION CRITERIA: For each criterion, answer TRUE if the content meets the standard, FALSE if it doesn't. 1. SATIRE EFFECTIVENESS - [ ] Content clearly parodies startup culture tropes - [ ] Parody is recognizable to tech industry insiders - [ ] Maintains balance between believable and absurd - [ ] Successfully mocks common startup practices 2. HUMOR CONSISTENCY - [ ] Humor level remains consistent throughout - [ ] No significant drops in comedic quality - [ ] Tone remains appropriate for satire - [ ] Jokes build upon each other effectively 3. CULTURAL RELEVANCE - [ ] References are current and timely - [ ] Captures current startup culture trends - [ ] Buzzwords are accurately parodied - [ ] Industry-specific knowledge is evident 4. NARRATIVE COHERENCE - [ ] Story follows internal logic - [ ] Pivots make sense within context - [ ] Character/voice remains consistent - [ ] Plot points connect logically 5. ORIGINALITY - [ ] Avoids overused startup jokes - [ ] Contains unique elements - [ ] Offers fresh perspective - [ ] Surprises the audience 6. TECHNICAL ACCURACY - [ ] Startup concepts are correctly parodied - [ ] Industry terminology is used appropriately - [ ] Business concepts are accurately mocked - [ ] Technical details are correctly referenced Answer TRUE only if ALL of the following conditions are met: - [ ] At least 80% of all criteria are rated TRUE - [ ] No critical criteria (Satire Effectiveness, Humor Consistency) are rated FALSE - [ ] Content would be considered funny by the target audience - [ ] Satire successfully achieves its intended purpose - [ ] Content maintains appropriate tone throughout
When you’ve determined your prompt, press ‘save’ then test your metric. The evaluation prompt will then be generated (automagically, even✨) and you will be able to see it in a preview.
From within the Custom Metric UI, select Test Metric.

Take the output from an earlier run, and paste it in the output section of the ‘test metric’ page, and check the response.

Continue tweaking the prompt until you have a metric that you feel confident with — the goal isn’t to have a metric that is perfect 100% of the time, but helps you determine what “good” looks like.
Have examples of what a subject matter expert would consider to be “good” and “bad” to test your metric for success.
Once tested, your metric will appear in the list of available metrics. Click on “Configure Metrics” and flip the toggle the metric you’ve created, on. From there, it can be used to assess future LLM outputs across runs and sessions—giving you visibility into quality over time.

As new runs come in, you’ll be able to quickly identify which tool chains, prompts, or model variants produce better (or worse) results. This is especially helpful for debugging “feels off” outputs that don’t trigger hard failures. No longer will you rely on vibes to ensure AI is put into production safely and securely.
From startup satire to serious signal
Sure, this Startup Sim 3000 might make you laugh—but under the hood, you just built something genuinely powerful (and hopefully had some fun and learned valuable skills as well!)
You didn’t just generate funny fake startup pitches. You:
Structured an agent-based AI system
Logged tool and model activity with Galileo
Created a custom metric to judge the quality of your output
Learned how to translate fuzzy, subjective ideas (like “humor”) into measurable, testable signals
And that’s the big idea:
AI quality is contextual—so your metrics should be too.
Whether you're working with comedy, contracts, curriculums, or customer support, success isn’t binary. It’s “it depends on context” And the only way to answer that at scale is with custom metrics grounded in your own domain knowledge.
That’s how you move from:
“It runs” → “It works well”
“Cool demo” → “Useful product”
“Kinda funny” → “Funny enough to ship”
For more tutorials like this, follow along at galileo.ai or check out our cookbooks. In the meantime, if you’ve got a joke to share (or if you’ve improved the Startup Simulator 3000), find me on GitHub, drop me an email, or toot at me on BlueSky.
Erin Mikail Staples
Erin Mikail Staples
Erin Mikail Staples
Erin Mikail Staples