Evaluating LlamaStack Web Search Groundedness with Llama 4
In this tutorial we will measure the groundedness of response generated by the LlamaStack's web search agent. LlamaStack is an open-source framework maintained by meta, that streamlines the development and deployment of large language model-powered applications. The evaluations will be done using the Ragas metrics and using Meta Llama 4 Maverick as the judge.
Setup and Running a LlamaStack server
This command installs all the dependencies needed for the LlamaStack server with the together inference provider
Use the command with conda
!pip install ragas langchain-together uv
!uv run --with llama-stack llama stack build --template together --image-type conda
Use the command with venv
!pip install ragas langchain-together uv
!uv run --with llama-stack llama stack build --template together --image-type venv
import os
import subprocess
def run_llama_stack_server_background():
log_file = open("llama_stack_server.log", "w")
process = subprocess.Popen(
"uv run --with llama-stack llama stack run together --image-type venv",
shell=True,
stdout=log_file,
stderr=log_file,
text=True,
)
print(f"Starting LlamaStack server with PID: {process.pid}")
return process
def wait_for_server_to_start():
import requests
from requests.exceptions import ConnectionError
import time
url = "http://0.0.0.0:8321/v1/health"
max_retries = 30
retry_interval = 1
print("Waiting for server to start", end="")
for _ in range(max_retries):
try:
response = requests.get(url)
if response.status_code == 200:
print("\nServer is ready!")
return True
except ConnectionError:
print(".", end="", flush=True)
time.sleep(retry_interval)
print("\nServer failed to start after", max_retries * retry_interval, "seconds")
return False
# use this helper if needed to kill the server
def kill_llama_stack_server():
# Kill any existing llama stack server processes
os.system(
"ps aux | grep -v grep | grep llama_stack.distribution.server.server | awk '{print $2}' | xargs kill -9"
)
Starting the LlamaStack Server
Building a Search Agent
from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
client = LlamaStackClient(
base_url="http://0.0.0.0:8321",
)
agent = Agent(
client,
model="meta-llama/Llama-3.1-8B-Instruct",
instructions="You are a helpful assistant. Use web search tool to answer the questions.",
tools=["builtin::websearch"],
)
user_prompts = [
"In which major did Demis Hassabis complete his undergraduate degree? Search the web for the answer.",
"Ilya Sutskever is one of the key figures in AI. From which institution did he earn his PhD in machine learning? Search the web for the answer.",
"Sam Altman, widely known for his role at OpenAI, was born in which American city? Search the web for the answer.",
]
session_id = agent.create_session("test-session")
for prompt in user_prompts:
response = agent.create_turn(
messages=[
{
"role": "user",
"content": prompt,
}
],
session_id=session_id,
)
for log in AgentEventLogger().log(response):
log.print()
Now, letβs look deeper into the agentβs execution steps and see if how well our agent performs.
session_response = client.agents.session.retrieve(
session_id=session_id,
agent_id=agent.agent_id,
)
Evaluate Agent Responses
We want to measure the Groundedness of response generated by the LlamaStack web search Agent. To do this we will need EvaluationDataset and metrics to assess the grounded response, Ragas provides a wide array of off the shelf metrics that can be used to measure various aspects of retrieval and generations.
For measuring groundedness of response we will use:-
Constructing a Ragas EvaluationDataset
To perform evaluations using Ragas we will create a EvaluationDataset
import json
# This function extracts the search results for the trace of each query
def extract_retrieved_contexts(turn_object):
results = []
for step in turn_object.steps:
if step.step_type == "tool_execution":
tool_responses = step.tool_responses
for response in tool_responses:
content = response.content
if content:
try:
parsed_result = json.loads(content)
results.append(parsed_result)
except json.JSONDecodeError:
print("Warning: Unable to parse tool response content as JSON.")
continue
retrieved_context = []
for result in results:
top_content_list = [item["content"] for item in result["top_k"]]
retrieved_context.extend(top_content_list)
return retrieved_context
from ragas.dataset_schema import EvaluationDataset
samples = []
references = [
"Demis Hassabis completed his undergraduate degree in Computer Science.",
"Ilya Sutskever earned his PhD from the University of Toronto.",
"Sam Altman was born in Chicago, Illinois.",
]
for i, turn in enumerate(session_response.turns):
samples.append(
{
"user_input": turn.input_messages[0].content,
"response": turn.output_message.content,
"reference": references[i],
"retrieved_contexts": extract_retrieved_contexts(turn),
}
)
ragas_eval_dataset = EvaluationDataset.from_list(samples)
user_input | retrieved_contexts | response | reference | |
---|---|---|---|---|
0 | In which major did Demis Hassabis complete his... | [Demis Hassabis holds a Bachelor's degree in C... | Demis Hassabis completed his undergraduate deg... | Demis Hassabis completed his undergraduate deg... |
1 | Ilya Sutskever is one of the key figures in AI... | [Jump to content Main menu Search Donate Creat... | Ilya Sutskever earned his PhD in machine learn... | Ilya Sutskever earned his PhD from the Univers... |
2 | Sam Altman, widely known for his role at OpenA... | [Sam Altman | Biography, OpenAI, Microsoft, & ... | Sam Altman was born in Chicago, Illinois, USA. | Sam Altman was born in Chicago, Illinois. |
Setting the Ragas Metrics
from ragas.metrics import AnswerAccuracy, Faithfulness, ResponseGroundedness
from langchain_together import ChatTogether
from ragas.llms import LangchainLLMWrapper
llm = ChatTogether(
model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
)
evaluator_llm = LangchainLLMWrapper(llm)
ragas_metrics = [
AnswerAccuracy(llm=evaluator_llm),
Faithfulness(llm=evaluator_llm),
ResponseGroundedness(llm=evaluator_llm),
]
Evaluation
Finally, let's run the evaluation.
from ragas import evaluate
results = evaluate(dataset=ragas_eval_dataset, metrics=ragas_metrics)
results.to_pandas()
user_input | retrieved_contexts | response | reference | nv_accuracy | faithfulness | nv_response_groundedness | |
---|---|---|---|---|---|---|---|
0 | In which major did Demis Hassabis complete his... | [Demis Hassabis holds a Bachelor's degree in C... | Demis Hassabis completed his undergraduate deg... | Demis Hassabis completed his undergraduate deg... | 1.0 | 1.0 | 1.00 |
1 | Ilya Sutskever is one of the key figures in AI... | [Jump to content Main menu Search Donate Creat... | Ilya Sutskever earned his PhD in machine learn... | Ilya Sutskever earned his PhD from the Univers... | 1.0 | 0.5 | 0.75 |
2 | Sam Altman, widely known for his role at OpenA... | [Sam Altman | Biography, OpenAI, Microsoft, & ... | Sam Altman was born in Chicago, Illinois, USA. | Sam Altman was born in Chicago, Illinois. | 1.0 | 1.0 | 1.00 |