Workflow Evaluation Quickstart

The workflow_eval template evaluates complex LLM workflows with email classification and routing.

Create the Project

ragas quickstart workflow_eval
cd workflow_eval

Install Dependencies

uv sync

Set Your API Key

export OPENAI_API_KEY="your-openai-key"

Run the Evaluation

uv run python evals.py

Project Structure

workflow_eval/
├── README.md              # Project documentation
├── pyproject.toml         # Project configuration
├── workflow.py            # Workflow implementation
├── evals.py               # Evaluation workflow
├── __init__.py            # Python package marker
└── evals/
    ├── datasets/          # Test datasets
    ├── experiments/       # Evaluation results
    └── logs/              # Execution logs

What It Evaluates

The template evaluates a customer support email classification workflow:

Workflow: Multi-step email processing (classification → extraction → response)
Categories: Bug Report, Feature Request, Billing
Test Cases: Customer emails with expected categories and extracted fields
Metric: Custom discrete metric checking classification accuracy

Understanding the Code

The Workflow (`workflow.py`)

Implements a customer support email workflow:

from workflow import default_workflow_client

workflow = default_workflow_client()
result = workflow.process_email("I found a bug in version 2.1.4...")
# Returns: category, extracted fields, response

The Evaluation (`evals.py`)

Tests workflow accuracy against pass criteria:

def load_dataset():
    dataset_dict = [
        {
            "email": "Hi, I'm getting error code XYZ-123 when using version 2.1.4...",
            "pass_criteria": "category Bug Report; product_version 2.1.4; error_code XYZ-123",
        },
        # More test cases...
    ]

The metric evaluates if the workflow correctly: - Classifies the email category - Extracts relevant fields (version, error code, invoice number, etc.) - Generates appropriate responses

Test Cases

The template includes diverse scenarios:

Bug Reports: With version numbers and error codes
Feature Requests: With urgency levels and product areas
Billing Issues: With invoice numbers and amounts

Customization

Add Your Own Workflow

Replace the example workflow with your own:

from your_workflow import YourWorkflow

workflow = YourWorkflow()

@experiment()
async def run_experiment(row):
    result = await workflow.process(row["input"])
    # Evaluate result...

Next Steps

Agent Evaluation - Evaluate AI agents
LlamaIndex Agent Evaluation - Evaluate LlamaIndex workflows