AG-UI
AG-UI is an event-based protocol for streaming agent updates to user interfaces. The protocol standardizes message, tool-call, and state events, which makes it easy to plug different agent runtimes into visual frontends. The ragas.integrations.ag_ui module helps you transform those event streams into Ragas message objects and run experiments against live AG-UI endpoints using the modern @experiment decorator pattern.
This guide assumes you already have an AG-UI compatible agent running (for example, one built with Google ADK, PydanticAI, or CrewAI) and that you are familiar with creating datasets in Ragas.
Install the integration
The AG-UI helpers live behind an optional extra. Install it together with the dependencies required by your evaluator LLM. When running inside Jupyter or IPython, include nest_asyncio so you can reuse the notebook's event loop.
Configure your evaluator LLM credentials. For example, if you are using OpenAI models:
Load the environment variables inside Python before running the examples:
from dotenv import load_dotenv
import nest_asyncio
load_dotenv()
# If you're inside Jupyter/IPython, patch the running event loop once.
nest_asyncio.apply()
Build an experiment dataset
Dataset can contain single-turn or multi-turn samples. With AG-UI you can test either pattern—single questions with free-form responses, or longer conversations that include tool calls.
Single-turn samples
Use Dataset.from_pandas() with user_input and reference columns when you only need to grade the final answer text.
import pandas as pd
from ragas.dataset import Dataset
scientist_questions = Dataset.from_pandas(
pd.DataFrame([
{
"user_input": "Who originated the theory of relativity?",
"reference": "Albert Einstein originated the theory of relativity.",
},
{
"user_input": "Who discovered penicillin and when?",
"reference": "Alexander Fleming discovered penicillin in 1928.",
},
]),
name="scientist_questions",
backend="inmemory",
)
Multi-turn samples with tool expectations
When you want to grade intermediate agent behavior—like whether it calls tools correctly and achieves the user's goal—use conversation lists as user_input. Provide expected tool calls as JSON and optionally a reference outcome for goal accuracy evaluation.
import json
import pandas as pd
from ragas.dataset import Dataset
from ragas.messages import HumanMessage
weather_queries = Dataset.from_pandas(
pd.DataFrame([
{
"user_input": [HumanMessage(content="What's the weather in Paris?")],
"reference_tool_calls": json.dumps([
{"name": "get_weather", "args": {"location": "Paris"}}
]),
# Expected outcome for AgentGoalAccuracyWithReference
"reference": "The user received the current weather conditions for Paris.",
},
{
"user_input": [HumanMessage(content="Is it raining in London right now?")],
"reference_tool_calls": json.dumps([
{"name": "get_weather", "args": {"location": "London"}}
]),
"reference": "The user received the current weather conditions for London.",
},
]),
name="weather_queries",
backend="inmemory",
)
Loading from CSV
For larger datasets, store your test cases in CSV files and load them with the Dataset API:
from ragas.dataset import Dataset
dataset = Dataset.load(
name="scientist_biographies",
backend="local/csv",
root_dir="./test_data",
)
Choose metrics and evaluator model
The integration works with any Ragas metric. To unlock the modern collections portfolio (and mix in custom checks), build an Instructor-compatible LLM for the evaluator prompts and use a synchronous OpenAI client for embeddings.
from openai import AsyncOpenAI, OpenAI
from ragas.llms import llm_factory
from ragas.embeddings import embedding_factory
from ragas.metrics import DiscreteMetric
from ragas.metrics.collections import (
AgentGoalAccuracyWithReference,
AnswerRelevancy,
FactualCorrectness,
ToolCallF1,
)
async_llm_client = AsyncOpenAI()
evaluator_llm = llm_factory("gpt-4o-mini", client=async_llm_client)
# AnswerRelevancy's embeddings still run synchronously, so pair it with a sync client.
embedding_client = OpenAI()
evaluator_embeddings = embedding_factory(
"openai", model="text-embedding-3-small", client=embedding_client, interface="modern"
)
conciseness_metric = DiscreteMetric(
name="conciseness",
allowed_values=["verbose", "concise"],
prompt=(
"Is the response concise and efficiently conveys information?\n\n"
"Response: {response}\n\n"
"Answer with only 'verbose' or 'concise'."
),
)
# Metrics for single-turn Q&A evaluation
qa_metrics = [
FactualCorrectness(
llm=evaluator_llm, mode="f1", atomicity="high", coverage="high"
),
AnswerRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings, strictness=2),
conciseness_metric,
]
# Metrics for multi-turn agent evaluation
# - ToolCallF1: Rule-based metric for tool call accuracy
# - AgentGoalAccuracyWithReference: LLM-based metric for goal achievement
tool_metrics = [
ToolCallF1(),
AgentGoalAccuracyWithReference(llm=evaluator_llm),
]
Run experiments with @experiment
The AG-UI integration provides run_ag_ui_row() to call your endpoint and enrich each row with the agent's response. Combine this with the @experiment decorator to build evaluation pipelines.
⚠️ The endpoint must expose the AG-UI SSE stream. Common paths include
/chat,/agent, or/agentic_chat.
Basic single-turn evaluation
In Jupyter or IPython, use top-level await (after nest_asyncio.apply()) instead of asyncio.run to avoid the "event loop is already running" error. For scripts you can keep asyncio.run.
from ragas import experiment
from ragas.integrations.ag_ui import run_ag_ui_row
from ragas.metrics.collections import FactualCorrectness
@experiment()
async def factual_experiment(row):
# Call AG-UI endpoint and get enriched row
enriched = await run_ag_ui_row(row, "http://localhost:8000/chat")
# Score with metrics
score = await FactualCorrectness(llm=evaluator_llm).ascore(
response=enriched["response"],
reference=row["reference"],
)
return {**enriched, "factual_correctness": score.value}
# Run the experiment against the dataset
# In Jupyter/IPython (after calling nest_asyncio.apply())
factual_result = await factual_experiment.arun(
scientist_questions,
name="scientist_qa_eval"
)
# In a standalone script, use:
# factual_result = asyncio.run(factual_experiment.arun(scientist_questions, name="scientist_qa_eval"))
factual_result.to_pandas()
The resulting dataframe includes per-sample scores, raw agent responses, and any retrieved contexts (tool results). Results are automatically saved by the experiment framework, and you can export to CSV through pandas.
Multi-turn tool evaluation
For multi-turn datasets and tool evaluation, pass the messages and reference tool calls directly to the metrics:
import json
from ragas import experiment
from ragas.integrations.ag_ui import run_ag_ui_row
from ragas.messages import ToolCall
from ragas.metrics.collections import AgentGoalAccuracyWithReference, ToolCallF1
@experiment()
async def tool_experiment(row):
# Call AG-UI endpoint and get enriched row
enriched = await run_ag_ui_row(row, "http://localhost:8000/chat")
# Parse reference_tool_calls from JSON string (e.g., from CSV)
ref_tool_calls_raw = row.get("reference_tool_calls")
if isinstance(ref_tool_calls_raw, str):
ref_tool_calls = [ToolCall(**tc) for tc in json.loads(ref_tool_calls_raw)]
else:
ref_tool_calls = ref_tool_calls_raw or []
# Score with tool metrics using the modern collections API
f1_result = await ToolCallF1().ascore(
user_input=enriched["messages"],
reference_tool_calls=ref_tool_calls,
)
goal_result = await AgentGoalAccuracyWithReference(llm=evaluator_llm).ascore(
user_input=enriched["messages"],
reference=row.get("reference", ""),
)
return {
**enriched,
"tool_call_f1": f1_result.value,
"agent_goal_accuracy": goal_result.value,
}
# Run the experiment
# In Jupyter/IPython
tool_result = await tool_experiment.arun(
weather_queries,
name="weather_tool_eval"
)
# Or in a script
# tool_result = asyncio.run(tool_experiment.arun(weather_queries, name="weather_tool_eval"))
tool_result.to_pandas()
If a request fails, the experiment logs the error and returns placeholder values for that sample so the experiment can continue with remaining samples.
Working directly with AG-UI events
Sometimes you may want to collect event logs separately—perhaps from a recorded run or a staging environment—and evaluate them offline. The conversion helpers expose the same parsing logic used by run_ag_ui_row().
from ragas.integrations.ag_ui import convert_to_ragas_messages
from ag_ui.core import TextMessageChunkEvent
events = [
TextMessageChunkEvent(
message_id="assistant-1",
role="assistant",
delta="Hello from AG-UI!",
timestamp="2024-12-01T00:00:00Z",
)
]
ragas_messages = convert_to_ragas_messages(events, metadata=True)
If you already have a MessagesSnapshotEvent you can skip streaming reconstruction and call convert_messages_snapshot.
from ragas.integrations.ag_ui import convert_messages_snapshot
from ag_ui.core import MessagesSnapshotEvent, UserMessage, AssistantMessage
snapshot = MessagesSnapshotEvent(
messages=[
UserMessage(id="msg-1", content="Hello?"),
AssistantMessage(id="msg-2", content="Hi! How can I help you today?"),
]
)
ragas_messages = convert_messages_snapshot(snapshot)
The converted messages can be used to build custom evaluation workflows or passed directly to metric scoring functions.
Extraction helpers
The integration provides helper functions to extract specific data from messages:
from ragas.integrations.ag_ui import (
extract_response, # Get concatenated AI response text
extract_tool_calls, # Get all tool calls from AI messages
extract_contexts, # Get tool results/contexts
)
messages = convert_to_ragas_messages(events)
response = extract_response(messages) # "Hello! The weather is sunny."
tool_calls = extract_tool_calls(messages) # [ToolCall(name="get_weather", args={"location": "SF"})]
contexts = extract_contexts(messages) # ["Sunny, 72F in San Francisco"]
Tips for production experiments
- Custom headers: pass authentication tokens or tenant IDs via
extra_headersparameter torun_ag_ui_row(). - Timeouts: tune the
timeoutparameter if your agent performs long-running tool calls. - Metadata debugging: set
metadata=Trueto keep AG-UI run, thread, and message IDs on every message for easier traceability. - Experiment naming: use descriptive
namearguments to.arun()for easy identification of results.
For a complete production example, see examples/ragas_examples/ag_ui_agent_experiments/experiments.py which provides:
- CLI arguments for endpoint configuration
- CSV-based test datasets
- Proper logging and error handling
- Timestamped result output
An interactive walkthrough notebook is also available at howtos/integrations/ag_ui.ipynb.
API Reference
Primary API
run_ag_ui_row(row, endpoint_url, ...)- Run a single row against an AG-UI endpoint and return enriched data with response, messages, tool_calls, and contexts.
Conversion Functions
convert_to_ragas_messages(events, metadata=False)- Convert AG-UI event sequences to Ragas messagesconvert_messages_snapshot(snapshot, metadata=False)- Convert AG-UI message snapshots to Ragas messagesconvert_messages_to_ag_ui(messages)- Convert Ragas messages to AG-UI format
Extraction Helpers
extract_response(messages)- Extract concatenated AI response textextract_tool_calls(messages)- Extract all tool calls from AI messagesextract_contexts(messages)- Extract tool results/contexts from messages
Low-Level
call_ag_ui_endpoint(endpoint_url, user_input, ...)- Call an AG-UI endpoint and collect streaming eventsAGUIEventCollector- Collect and reconstruct messages from streaming events