RAG Evaluation Quickstart
The rag_eval template provides a complete RAG evaluation setup with custom metrics, dataset management, and experiment tracking.
Create the Project
# Using uvx (no installation required)
uvx ragas quickstart rag_eval
cd rag_eval
# Or with ragas installed
ragas quickstart rag_eval
cd rag_eval
Install Dependencies
Or with pip:
Set Your API Key
Update evals.py:
Update evals.py:
Run the Evaluation
The evaluation will:
- Load test data from the
load_dataset()function - Query your RAG application with test questions
- Evaluate responses using custom metrics
- Display results in the console
- Save results to CSV in
evals/experiments/
Project Structure
rag_eval/
├── README.md # Project documentation
├── pyproject.toml # Project configuration
├── rag.py # RAG application implementation
├── evals.py # Evaluation workflow
├── __init__.py # Python package marker
└── evals/
├── datasets/ # Test data files
├── experiments/ # Evaluation results (CSV)
└── logs/ # Execution logs and traces
Understanding the Code
The RAG Application (rag.py)
A simple RAG implementation with:
- Document storage: In-memory document collection
- Keyword retrieval: Simple keyword matching for document retrieval
- Response generation: OpenAI API for generating answers
- Tracing: Logs each query for debugging
from rag import default_rag_client
# Initialize with OpenAI client
rag_client = default_rag_client(llm_client=openai_client, logdir="evals/logs")
# Query the RAG system
response = rag_client.query("What is Ragas?")
print(response["answer"])
The Evaluation Script (evals.py)
The evaluation workflow:
- Dataset loading: Creates test cases with questions and grading notes
- Metric definition: Custom
DiscreteMetricfor pass/fail evaluation - Experiment execution: Runs queries and evaluates responses
- Result storage: Saves to CSV for analysis
from ragas import Dataset, experiment
from ragas.metrics import DiscreteMetric
# Define your metric
my_metric = DiscreteMetric(
name="correctness",
prompt="Check if the response contains points from grading notes...",
allowed_values=["pass", "fail"],
)
# Run experiment
@experiment()
async def run_experiment(row):
response = rag_client.query(row["question"])
score = my_metric.score(llm=llm, response=response["answer"], ...)
return {**row, "response": response["answer"], "score": score.value}
Customization
Add Test Cases
Edit the load_dataset() function in evals.py:
def load_dataset():
dataset = Dataset(
name="test_dataset",
backend="local/csv",
root_dir="evals",
)
data_samples = [
{
"question": "What is Ragas?",
"grading_notes": "- evaluation framework - LLM applications",
},
{
"question": "How do experiments work?",
"grading_notes": "- track results - compare runs - store metrics",
},
# Add more test cases...
]
for sample in data_samples:
dataset.append(sample)
dataset.save()
return dataset
Modify the Metric
Change evaluation criteria by updating the metric prompt:
my_metric = DiscreteMetric(
name="quality",
prompt="""Evaluate the response quality:
Response: {response}
Expected Points: {grading_notes}
Rate as:
- 'excellent': All points covered with clear explanation
- 'good': Most points covered
- 'poor': Missing key points
Rating:""",
allowed_values=["excellent", "good", "poor"],
)
Add Multiple Metrics
Create additional metrics for different evaluation aspects:
from ragas.metrics import DiscreteMetric, NumericalMetric
correctness = DiscreteMetric(
name="correctness",
prompt="Is the response factually correct? {response}",
allowed_values=["correct", "incorrect"],
)
relevance = NumericalMetric(
name="relevance",
prompt="Rate relevance 1-5: {response} for question: {question}",
allowed_values=(1, 5),
)
Use Your Own RAG System
Replace the example RAG with your production system:
# In evals.py
from your_rag_module import YourRAGClient
rag_client = YourRAGClient(...)
@experiment()
async def run_experiment(row):
# Call your RAG system
response = await rag_client.query(row["question"])
score = my_metric.score(
llm=llm,
response=response,
grading_notes=row["grading_notes"],
)
return {
**row,
"response": response,
"score": score.value,
}
Viewing Results
Results are saved to evals/experiments/ as CSV files. Each experiment run creates a new file with:
- Input data (questions, grading notes)
- Model responses
- Evaluation scores
- Timestamps
import pandas as pd
# Load results
results = pd.read_csv("evals/experiments/your_experiment.csv")
# Calculate pass rate
pass_rate = (results["score"] == "pass").mean()
print(f"Pass rate: {pass_rate:.1%}")
Next Steps
- Improve RAG Guide - Compare naive vs agentic RAG
- Custom Metrics - Write your own metrics
- Datasets - Learn about dataset management
- Experimentation - Advanced experiment tracking