LLM Benchmarking Quickstart
The benchmark_llm template benchmarks and compares different LLM models on discount calculation tasks.
Create the Project
Install Dependencies
Set Your API Keys
Run the Evaluation
To benchmark a specific model:
Project Structure
benchmark_llm/
├── README.md # Project documentation
├── pyproject.toml # Project configuration
├── prompt.py # Prompt implementation
├── evals.py # Evaluation workflow
├── __init__.py # Python package marker
└── evals/
├── datasets/
│ └── discount_benchmark.csv # Customer profiles and expected discounts
├── experiments/ # Evaluation results
└── logs/ # Execution logs
What It Evaluates
The template benchmarks LLM performance on structured output tasks:
- Task: Calculate customer discount percentages based on profile
- Models: Compare GPT-4, GPT-3.5, Claude, Gemini, etc.
- Output Format: JSON with discount percentage
- Metric: Discount accuracy (correct/incorrect)
Understanding the Code
The Prompt (prompt.py)
Calculates discounts from customer profiles:
from prompt import run_prompt
profile = "Premium customer, 5 years tenure, $50k annual spend"
result = await run_prompt(profile, model="gpt-4o")
# Returns: {"discount_percentage": 15}
The Evaluation (evals.py)
Benchmarks model accuracy:
@discrete_metric(name="discount_accuracy", allowed_values=["correct", "incorrect"])
def discount_accuracy(prediction: str, expected_discount):
parsed_json = json.loads(prediction)
predicted_discount = parsed_json.get("discount_percentage")
if predicted_discount == int(expected_discount):
return MetricResult(value="correct", ...)
else:
return MetricResult(value="incorrect", ...)
Test Data
The template includes evals/datasets/discount_benchmark.csv with:
- Customer profiles (tenure, spend, tier, etc.)
- Expected discount percentages
- Business rules for discount calculation
Benchmarking Multiple Models
Run the same evaluation across different models:
# GPT-4
uv run python evals.py --model gpt-4o
# GPT-3.5
uv run python evals.py --model gpt-3.5-turbo
# Claude
uv run python evals.py --model claude-3-5-sonnet-20241022
# Compare results
Customization
Add Your Own Task
Modify the prompt to benchmark different capabilities:
# Code generation
prompt = "Generate Python code to {task}"
# Summarization
prompt = "Summarize this text in 50 words: {text}"
# Classification
prompt = "Classify this email as spam/not-spam: {email}"
Compare Cost and Latency
Track additional metrics:
import time
start = time.time()
response = await run_prompt(profile, model=model_name)
latency = time.time() - start
# Log cost and latency alongside accuracy
Analyzing Results
Compare model performance:
import pandas as pd
gpt4_results = pd.read_csv("evals/experiments/gpt4_benchmark.csv")
gpt35_results = pd.read_csv("evals/experiments/gpt35_benchmark.csv")
print(f"GPT-4 Accuracy: {(gpt4_results['discount_accuracy'] == 'correct').mean():.1%}")
print(f"GPT-3.5 Accuracy: {(gpt35_results['discount_accuracy'] == 'correct').mean():.1%}")
Next Steps
- Judge Alignment - Measure judge alignment
- Prompt Evaluation - Compare different prompts