Evaluating LlamaStack Web Search Groundedness with Llama 4

In this tutorial we will measure the groundedness of response generated by the LlamaStack's web search agent. LlamaStack is an open-source framework maintained by meta, that streamlines the development and deployment of large language model-powered applications. The evaluations will be done using the Ragas metrics and using Meta Llama 4 Maverick as the judge.

Setup and Running a LlamaStack server

This command installs all the dependencies needed for the LlamaStack server with the together inference provider

Use the command with conda

!pip install ragas langchain-together uv 
!uv run --with llama-stack llama stack build --template together --image-type conda

Use the command with venv

!pip install ragas langchain-together uv 
!uv run --with llama-stack llama stack build --template together --image-type venv

import os
import subprocess


def run_llama_stack_server_background():
    log_file = open("llama_stack_server.log", "w")
    process = subprocess.Popen(
        "uv run --with llama-stack llama stack run together --image-type venv",
        shell=True,
        stdout=log_file,
        stderr=log_file,
        text=True,
    )

    print(f"Starting LlamaStack server with PID: {process.pid}")
    return process


def wait_for_server_to_start():
    import requests
    from requests.exceptions import ConnectionError
    import time

    url = "http://0.0.0.0:8321/v1/health"
    max_retries = 30
    retry_interval = 1

    print("Waiting for server to start", end="")
    for _ in range(max_retries):
        try:
            response = requests.get(url)
            if response.status_code == 200:
                print("\nServer is ready!")
                return True
        except ConnectionError:
            print(".", end="", flush=True)
            time.sleep(retry_interval)

    print("\nServer failed to start after", max_retries * retry_interval, "seconds")
    return False


# use this helper if needed to kill the server
def kill_llama_stack_server():
    # Kill any existing llama stack server processes
    os.system(
        "ps aux | grep -v grep | grep llama_stack.distribution.server.server | awk '{print $2}' | xargs kill -9"
    )

Starting the LlamaStack Server

server_process = run_llama_stack_server_background()
assert wait_for_server_to_start()

Starting LlamaStack server with PID: 95508
Waiting for server to start....
Server is ready!

Building a Search Agent

from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger

client = LlamaStackClient(
    base_url="http://0.0.0.0:8321",
)

agent = Agent(
    client,
    model="meta-llama/Llama-3.1-8B-Instruct",
    instructions="You are a helpful assistant. Use web search tool to answer the questions.",
    tools=["builtin::websearch"],
)
user_prompts = [
    "In which major did Demis Hassabis complete his undergraduate degree? Search the web for the answer.",
    "Ilya Sutskever is one of the key figures in AI. From which institution did he earn his PhD in machine learning? Search the web for the answer.",
    "Sam Altman, widely known for his role at OpenAI, was born in which American city? Search the web for the answer.",
]

session_id = agent.create_session("test-session")


for prompt in user_prompts:
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=session_id,
    )
    for log in AgentEventLogger().log(response):
        log.print()

Now, let’s look deeper into the agent’s execution steps and see if how well our agent performs.

session_response = client.agents.session.retrieve(
    session_id=session_id,
    agent_id=agent.agent_id,
)

Evaluate Agent Responses

We want to measure the Groundedness of response generated by the LlamaStack web search Agent. To do this we will need EvaluationDataset and metrics to assess the grounded response, Ragas provides a wide array of off the shelf metrics that can be used to measure various aspects of retrieval and generations.

For measuring groundedness of response we will use:-

Constructing a Ragas EvaluationDataset

To perform evaluations using Ragas we will create a EvaluationDataset

import json

# This function extracts the search results for the trace of each query
def extract_retrieved_contexts(turn_object):
    results = []
    for step in turn_object.steps:
        if step.step_type == "tool_execution":
            tool_responses = step.tool_responses
            for response in tool_responses:
                content = response.content
                if content:
                    try:
                        parsed_result = json.loads(content)
                        results.append(parsed_result)
                    except json.JSONDecodeError:
                        print("Warning: Unable to parse tool response content as JSON.")
                        continue

    retrieved_context = []
    for result in results:
        top_content_list = [item["content"] for item in result["top_k"]]
        retrieved_context.extend(top_content_list)
    return retrieved_context

from ragas.dataset_schema import EvaluationDataset

samples = []

references = [
    "Demis Hassabis completed his undergraduate degree in Computer Science.",
    "Ilya Sutskever earned his PhD from the University of Toronto.",
    "Sam Altman was born in Chicago, Illinois.",
]

for i, turn in enumerate(session_response.turns):
    samples.append(
        {
            "user_input": turn.input_messages[0].content,
            "response": turn.output_message.content,
            "reference": references[i],
            "retrieved_contexts": extract_retrieved_contexts(turn),
        }
    )

ragas_eval_dataset = EvaluationDataset.from_list(samples)

ragas_eval_dataset.to_pandas()

	user_input	retrieved_contexts	response	reference
0	In which major did Demis Hassabis complete his...	[Demis Hassabis holds a Bachelor's degree in C...	Demis Hassabis completed his undergraduate deg...	Demis Hassabis completed his undergraduate deg...
1	Ilya Sutskever is one of the key figures in AI...	[Jump to content Main menu Search Donate Creat...	Ilya Sutskever earned his PhD in machine learn...	Ilya Sutskever earned his PhD from the Univers...
2	Sam Altman, widely known for his role at OpenA...	[Sam Altman \| Biography, OpenAI, Microsoft, & ...	Sam Altman was born in Chicago, Illinois, USA.	Sam Altman was born in Chicago, Illinois.

Setting the Ragas Metrics

from ragas.metrics import AnswerAccuracy, Faithfulness, ResponseGroundedness
from langchain_together import ChatTogether
from ragas.llms import LangchainLLMWrapper

llm = ChatTogether(
    model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
)
evaluator_llm = LangchainLLMWrapper(llm)

ragas_metrics = [
    AnswerAccuracy(llm=evaluator_llm),
    Faithfulness(llm=evaluator_llm),
    ResponseGroundedness(llm=evaluator_llm),
]

Evaluation

Finally, let's run the evaluation.

from ragas import evaluate

results = evaluate(dataset=ragas_eval_dataset, metrics=ragas_metrics)
results.to_pandas()

Evaluating: 100%|██████████| 9/9 [00:04<00:00,  2.03it/s]

	user_input	retrieved_contexts	response	reference	nv_accuracy	faithfulness	nv_response_groundedness
0	In which major did Demis Hassabis complete his...	[Demis Hassabis holds a Bachelor's degree in C...	Demis Hassabis completed his undergraduate deg...	Demis Hassabis completed his undergraduate deg...	1.0	1.0	1.00
1	Ilya Sutskever is one of the key figures in AI...	[Jump to content Main menu Search Donate Creat...	Ilya Sutskever earned his PhD in machine learn...	Ilya Sutskever earned his PhD from the Univers...	1.0	0.5	0.75
2	Sam Altman, widely known for his role at OpenA...	[Sam Altman \| Biography, OpenAI, Microsoft, & ...	Sam Altman was born in Chicago, Illinois, USA.	Sam Altman was born in Chicago, Illinois.	1.0	1.0	1.00

kill_llama_stack_server()