Experiments
What is an experiment?
An experiment is a deliberate change made to your application to test a hypothesis or idea. For example, in a Retrieval-Augmented Generation (RAG) system, you might replace the retriever model to evaluate how a new embedding model impacts chatbot responses.
Principles of a Good Experiment
- Define measurable metrics: Use metrics like accuracy, precision, or recall to quantify the impact of your changes.
- Systematic result storage: Ensure results are stored in an organized manner for easy comparison and tracking.
- Isolate changes: Make one change at a time to identify its specific impact. Avoid making multiple changes simultaneously, as this can obscure the results.
- Iterative process: Follow a structured approach: *Make a change โ Run evaluations โ Observe results โ
graph LR A[Make a change] --> B[Run evaluations] B --> C[Observe results] C --> D[Hypothesize next change] D --> A
Experiments in Ragas
Components of an Experiment
- Test dataset: The data used to evaluate the system.
- Application endpoint: The application, component or model being tested.
- Metrics: Quantitative measures to assess performance.
Execution Process
Running an experiment involves:
- Executing the dataset against the application endpoint.
- Calculating metrics to quantify performance.
- Returning and storing the results.
Using the @experiment Decorator
The @experiment decorator in Ragas simplifies the orchestration, scaling, and storage of experiments. Here's an example:
from ragas import experiment
# Define your metric and dataset
my_metric = ...
dataset = ...
@experiment
async def my_experiment(row):
# Process the query through your application
response = my_app(row.query)
# Calculate the metric
metric = my_metric.score(response, row.ground_truth)
# Return results
return {**row, "response": response, "accuracy": metric.value}
# Run the experiment
my_experiment.arun(dataset)
Passing Additional Parameters
You can pass additional parameters to your experiment function through arun(). This is useful for models, configurations, or any other parameters your experiment needs:
@experiment
async def my_experiment(row, model):
# Process the query with the specified parameters
response = my_app(row.query, model=model)
# Calculate the metric
metric = my_metric.score(response, row.ground_truth)
# Return results
return {**row, "response": response, "accuracy": metric.value}
# Run with specific parameters
my_experiment.arun(dataset, "gpt-4")
# Or use keyword arguments
my_experiment.arun(dataset, model="gpt-4o")
Using Data Models
You can specify a data model for your experiment results at the decorator level:
from pydantic import BaseModel
class ExperimentResult(BaseModel):
response: str
accuracy: float
model_used: str
@experiment(experiment_model=ExperimentResult)
async def my_experiment(row, model):
response = my_app(row.query, model)
metric = my_metric.score(response, row.ground_truth)
return ExperimentResult(
response=response,
accuracy=metric.value,
model_used=model
)
# Run experiment with specific model
my_experiment.arun(dataset, model="gpt-4o")
Complete Example: LLM Parameter Passing
Here's a complete example showing how to pass different LLM models to your experiment function:
from pydantic import BaseModel
from ragas.experimental import Dataset
from ragas import experiment
class ExperimentResult(BaseModel):
query: str
response: str
accuracy: float
model_used: str
latency_ms: float
@experiment(experiment_model=ExperimentResult)
async def llm_experiment(row, llm_model, temperature=0.7):
"""Experiment function that accepts LLM model and other parameters."""
import time
start_time = time.time()
# Use the passed LLM model
response = await my_llm_app(
query=row.query,
model=llm_model,
temperature=temperature
)
# Calculate metrics
metric = my_metric.score(response, row.ground_truth)
end_time = time.time()
return ExperimentResult(
query=row.query,
response=response,
accuracy=metric.value,
model_used=llm_model,
latency_ms=(end_time - start_time) * 1000
)
# Run experiments with different models
gpt4_results = await llm_experiment.arun(
dataset,
llm_model="gpt-4o",
temperature=0.1
)
claude_results = await llm_experiment.arun(
dataset,
llm_model="claude-4-sonnet",
temperature=0.7
)
Result Storage
Once executed, Ragas processes each row in the dataset, runs it through the function, and stores the results in the experiments folder. The storage backend can be configured based on your preferences.