AG-UI Integration
Ragas can run experiments on agents that stream events via the AG-UI protocol. This notebook shows how to build experiment datasets, configure metrics, and score AG-UI endpoints using the modern @experiment decorator pattern.
Prerequisites
- Install dependencies:
pip install "ragas[ag-ui]" python-dotenv nest_asyncio - Start an AG-UI compatible agent locally (Google ADK, PydanticAI, CrewAI, etc.)
- Create an
.envfile with your evaluator LLM credentials (e.g.OPENAI_API_KEY,GOOGLE_API_KEY, etc.) - If you run this notebook, call
nest_asyncio.apply()(shown below) so you canawaitcoroutines in-place.
Imports and environment setup
Load environment variables and import the classes used throughout the walkthrough.
import json
import nest_asyncio
import pandas as pd
from dotenv import load_dotenv
from IPython.display import display
from ragas.dataset import Dataset
from ragas.messages import HumanMessage
load_dotenv()
# Patch the existing notebook loop so we can await coroutines safely
nest_asyncio.apply()
Build single-turn experiment data
Create dataset entries with user_input and reference using Dataset.from_pandas() when you only need to grade the final answer text.
scientist_questions = Dataset.from_pandas(
pd.DataFrame(
[
{
"user_input": "Who originated the theory of relativity?",
"reference": "Albert Einstein originated the theory of relativity.",
},
{
"user_input": "Who discovered penicillin and when?",
"reference": "Alexander Fleming discovered penicillin in 1928.",
},
]
),
name="scientist_questions",
backend="inmemory",
)
scientist_questions
Build multi-turn conversations
For tool-usage and goal accuracy metrics, provide:
- reference_tool_calls: Expected tool calls as JSON for ToolCallF1
- reference: Expected outcome description for AgentGoalAccuracyWithReference
weather_queries = Dataset.from_pandas(
pd.DataFrame(
[
{
"user_input": [HumanMessage(content="What's the weather in Paris?")],
"reference_tool_calls": json.dumps(
[{"name": "get_weather", "args": {"location": "Paris"}}]
),
# Expected outcome - phrased to match what LLM extracts as end_state
"reference": "The AI provided the current weather conditions for Paris.",
},
{
"user_input": [
HumanMessage(content="Is it raining in London right now?")
],
"reference_tool_calls": json.dumps(
[{"name": "get_weather", "args": {"location": "London"}}]
),
"reference": "The AI provided the current weather conditions for London.",
},
]
),
name="weather_queries",
backend="inmemory",
)
weather_queries
Configure metrics and the evaluator LLM
For single-turn Q&A experiments, we use:
- FactualCorrectness: Compares response facts against reference
- AnswerRelevancy: Measures how relevant the response is to the question
- DiscreteMetric: Custom metric for conciseness
For multi-turn agent experiments, we use:
- ToolCallF1: Rule-based metric comparing actual vs expected tool calls
- AgentGoalAccuracyWithReference: LLM-based metric evaluating whether the agent achieved the user's goal
from openai import AsyncOpenAI
from ragas.embeddings.base import embedding_factory
from ragas.llms import llm_factory
from ragas.metrics import DiscreteMetric
from ragas.metrics.collections import (
AgentGoalAccuracyWithReference,
AnswerRelevancy,
FactualCorrectness,
ToolCallF1,
)
# Async client for evaluator prompts
async_llm_client = AsyncOpenAI()
evaluator_llm = llm_factory("gpt-4o-mini", client=async_llm_client)
embedding_client = AsyncOpenAI()
evaluator_embeddings = embedding_factory(
"openai",
model="text-embedding-3-small",
client=embedding_client,
interface="modern",
)
conciseness_metric = DiscreteMetric(
name="conciseness",
allowed_values=["verbose", "concise"],
prompt=(
"Is the response concise and efficiently conveys information?\n\n"
"Response: {response}\n\n"
"Answer with only 'verbose' or 'concise'."
),
)
# Metrics for single-turn Q&A experiments
qa_metrics = [
FactualCorrectness(
llm=evaluator_llm,
mode="f1",
atomicity="high",
coverage="high",
),
AnswerRelevancy(
llm=evaluator_llm,
embeddings=evaluator_embeddings,
strictness=2,
),
conciseness_metric,
]
# Metrics for multi-turn agent experiments
# - ToolCallF1: Rule-based metric for tool call accuracy
# - AgentGoalAccuracyWithReference: LLM-based metric for goal achievement
tool_metrics = [
ToolCallF1(),
AgentGoalAccuracyWithReference(llm=evaluator_llm),
]
Run experiments against a live AG-UI endpoint
Set the endpoint URL exposed by your agent. The run_ag_ui_row() function calls your endpoint and returns enriched row data. Combine this with the @experiment decorator for evaluation pipelines.
Toggle the flags when you are ready to run the experiments. In Jupyter/IPython you can await the experiment directly once nest_asyncio.apply() has been called.
AG_UI_ENDPOINT = "http://localhost:8000" # Update to match your agent
RUN_FACTUAL_EXPERIMENT = True
RUN_TOOL_EXPERIMENT = True
from ragas import experiment
from ragas.integrations.ag_ui import run_ag_ui_row
@experiment()
async def factual_experiment(row):
"""Single-turn Q&A experiment with factual correctness scoring."""
# Call AG-UI endpoint and get enriched row
enriched = await run_ag_ui_row(row, AG_UI_ENDPOINT, metadata=True)
# Score with factual correctness metric
fc_result = await qa_metrics[0].ascore(
response=enriched["response"],
reference=row["reference"],
)
# Score with answer relevancy metric
ar_result = await qa_metrics[1].ascore(
user_input=row["user_input"],
response=enriched["response"],
)
# Score with conciseness metric
concise_result = await conciseness_metric.ascore(
response=enriched["response"],
llm=evaluator_llm,
)
return {
**enriched,
"factual_correctness": fc_result.value,
"answer_relevancy": ar_result.value,
"conciseness": concise_result.value,
}
if RUN_FACTUAL_EXPERIMENT:
# Run the experiment against the dataset
factual_result = await factual_experiment.arun(
scientist_questions, name="scientist_qa_experiment"
)
display(factual_result.to_pandas())
from ragas.messages import ToolCall
@experiment()
async def tool_experiment(row):
"""Multi-turn experiment with tool call and goal accuracy scoring."""
# Call AG-UI endpoint and get enriched row
enriched = await run_ag_ui_row(row, AG_UI_ENDPOINT)
# Parse reference_tool_calls from JSON string (e.g., from CSV)
ref_tool_calls_raw = row.get("reference_tool_calls")
if isinstance(ref_tool_calls_raw, str):
ref_tool_calls = [ToolCall(**tc) for tc in json.loads(ref_tool_calls_raw)]
else:
ref_tool_calls = ref_tool_calls_raw or []
# Score with tool metrics using the modern collections API
f1_result = await tool_metrics[0].ascore(
user_input=enriched["messages"],
reference_tool_calls=ref_tool_calls,
)
goal_result = await tool_metrics[1].ascore(
user_input=enriched["messages"],
reference=row.get("reference", ""),
)
return {
**enriched,
"tool_call_f1": f1_result.value,
"agent_goal_accuracy": goal_result.value,
}
if RUN_TOOL_EXPERIMENT:
# Run the experiment against the dataset
tool_result = await tool_experiment.arun(
weather_queries, name="weather_tool_experiment"
)
display(tool_result.to_pandas())
Advanced: Lower-Level Control
The run_ag_ui_row() function is the recommended API, but sometimes you need more control. You can use the lower-level call_ag_ui_endpoint() function directly.
This approach lets you:
- Customize event handling
- Add per-row endpoint configuration
- Implement custom message processing
- Add additional logging or debugging
from ragas.integrations.ag_ui import (
call_ag_ui_endpoint,
convert_to_ragas_messages,
extract_response,
)
@experiment()
async def custom_ag_ui_experiment(row):
"""
Custom experiment function with full control over endpoint calls.
"""
# Call the AG-UI endpoint directly (lower-level than run_ag_ui_row)
events = await call_ag_ui_endpoint(
endpoint_url=AG_UI_ENDPOINT,
user_input=row["user_input"],
timeout=60.0,
)
# Convert AG-UI events to Ragas messages
messages = convert_to_ragas_messages(events, metadata=True)
# Extract response using helper (or custom logic)
response = extract_response(messages)
# Score with a custom metric
score_result = await conciseness_metric.ascore(
response=response,
llm=evaluator_llm,
)
# Return result with custom fields
return {
**row,
"response": response or "[No response]",
"message_count": len(messages),
"conciseness": score_result.value,
}
Run the custom experiment against a dataset. The @experiment decorator provides .arun() for parallel execution and automatic result collection:
RUN_CUSTOM_EXPERIMENT = True
if RUN_CUSTOM_EXPERIMENT:
# Run the custom experiment
custom_result = await custom_ag_ui_experiment.arun(
scientist_questions, name="custom_ag_ui_experiment"
)
display(custom_result.to_pandas())
API Comparison
| API Level | Function | When to Use |
|---|---|---|
| High-level | run_ag_ui_row() |
Standard experiments - handles endpoint call, conversion, and extraction |
| Low-level | call_ag_ui_endpoint() + convert_to_ragas_messages() |
Custom event handling, per-row endpoint config, advanced debugging |
Both approaches work with the @experiment decorator - choose based on how much control you need.