LlamaIndex Agent Evaluation Quickstart
The llamaIndex_agent_evals template evaluates LlamaIndex workflow agents with tool call accuracy metrics.
Create the Project
Install Dependencies
Set Your API Keys
Run the Evaluation
Project Structure
llamaIndex_agent_evals/
├── README.md # Project documentation
├── pyproject.toml # Project configuration
├── llamaindex_agent.py # LlamaIndex agent with tools
├── evals.py # Evaluation workflow
├── __init__.py # Python package marker
└── evals/
├── datasets/
│ └── contexts/ # Test context files (JSON)
├── experiments/ # Evaluation results
└── logs/ # Execution logs
What It Evaluates
The template evaluates a LlamaIndex agent's tool calling accuracy:
- Agent: LlamaIndex
FunctionAgentwith list management tools (add, remove, list items) - Test Cases: Complex scenarios like duplicate additions, ambiguous removal requests
- Metrics: Tool call accuracy, response correctness
Understanding the Code
The Agent (llamaindex_agent.py)
LlamaIndex agent with simple tools:
from llama_index.core.agent.workflow import FunctionAgent
agent = FunctionAgent(
name="list_manager",
tools=[add_item, remove_item, list_items],
llm=llm
)
The Evaluation (evals.py)
Tests tool call accuracy using F1 score:
@numeric_metric(name="tool_call_accuracy")
def tool_call_accuracy_metric(predicted_calls: List[Dict], ground_truth_calls: List[Dict]):
# Compares predicted vs ground truth tool calls
# Returns F1 score between 0.0 and 1.0
Test Data
The template includes JSON test contexts in evals/datasets/contexts/:
ambiguous_removal_request.json- Tests handling of ambiguous requestsduplicate_addition.json- Tests handling of duplicate operationsrepeated_removal.json- Tests repeated operations
Next Steps
- Agent Evaluation - Evaluate general AI agents
- Workflow Evaluation - Evaluate complex workflows