Experiments

What is an experiment?

An experiment is a deliberate change made to your application to test a hypothesis or idea. For example, in a Retrieval-Augmented Generation (RAG) system, you might replace the retriever model to evaluate how a new embedding model impacts chatbot responses.

Principles of a Good Experiment

Define measurable metrics: Use metrics like accuracy, precision, or recall to quantify the impact of your changes.
Systematic result storage: Ensure results are stored in an organized manner for easy comparison and tracking.
Isolate changes: Make one change at a time to identify its specific impact. Avoid making multiple changes simultaneously, as this can obscure the results.
Iterative process: Follow a structured approach: *Make a change → Run evaluations → Observe results →

graph LR
    A[Make a change] --> B[Run evaluations]
    B --> C[Observe results]
    C --> D[Hypothesize next change]
    D --> A

Experiments in Ragas

Components of an Experiment

Test dataset: The data used to evaluate the system.
Application endpoint: The application, component or model being tested.
Metrics: Quantitative measures to assess performance.

Execution Process

Setup: Define the experiment parameters and load the test dataset.
Run: Execute the application on each sample in the dataset.
Evaluate: Apply metrics to measure performance.
Store: Save results for analysis and comparison.

Creating Experiments with Ragas

Ragas provides an @experiment decorator to streamline the experiment creation process. If you prefer a hands-on intro first, see the Quick Start guide.

Basic Experiment Structure

from ragas import experiment
import asyncio

@experiment()
async def my_experiment(row):
    # Process the input through your system
    response = await asyncio.to_thread(my_system_function, row["input"])

    # Return results for evaluation
    return {
        **row,  # Include original data
        "response": response,
        "experiment_name": "baseline_v1",
        # Add any additional metadata
        "model_version": "gpt-4o",
        "timestamp": datetime.now().isoformat()
    }

Running Experiments

from ragas import Dataset

# Load your test dataset
dataset = Dataset.load(name="test_data", backend="local/csv", root_dir="./data")

# Run the experiment
results = await my_experiment.arun(dataset)

Parameterized Experiments

You can create parameterized experiments to test different configurations:

@experiment()
async def model_comparison_experiment(row, model_name: str, temperature: float):
    # Configure your system with the parameters
    response = await my_system_function(
        row["input"], 
        model=model_name, 
        temperature=temperature
    )

    return {
        **row,
        "response": response,
        "experiment_name": f"{model_name}_temp_{temperature}",
        "model_name": model_name,
        "temperature": temperature
    }

# Run with different parameters
results_gpt4 = await model_comparison_experiment.arun(
    dataset, 
    model_name="gpt-4o", 
    temperature=0.1
)

results_gpt35 = await model_comparison_experiment.arun(
    dataset, 
    model_name="gpt-3.5-turbo", 
    temperature=0.1
)

Experiment Management Best Practices

1. Consistent Naming

Use descriptive names that include: - What changed (model, prompt, parameters) - Version numbers - Date/time if relevant

experiment_name = "gpt4o_v2_prompt_temperature_0.1_20241201"

2. Result Storage

Experiments automatically save results to CSV files in the experiments/ directory with timestamps:

experiments/
├── 20241201-143022-baseline_v1.csv
├── 20241201-143515-gpt4o_improved_prompt.csv
└── 20241201-144001-comparison.csv

3. Metadata Tracking

Include relevant metadata in your experiment results:

return {
    **row,
    "response": response,
    "experiment_name": "baseline_v1",
    "git_commit": "a1b2c3d",
    "environment": "staging",
    "model_version": "gpt-4o-2024-08-06",
    "total_tokens": response.usage.total_tokens,
    "response_time_ms": response_time
}

Advanced Experiment Patterns

A/B Testing

Test two different approaches simultaneously:

@experiment()
async def ab_test_experiment(row, variant: str):
    if variant == "A":
        response = await system_variant_a(row["input"])
    else:
        response = await system_variant_b(row["input"])

    return {
        **row,
        "response": response,
        "variant": variant,
        "experiment_name": f"ab_test_variant_{variant}"
    }

# Run both variants
results_a = await ab_test_experiment.arun(dataset, variant="A")
results_b = await ab_test_experiment.arun(dataset, variant="B")

Multi-Stage Experiments

For complex systems with multiple components:

@experiment()
async def multi_stage_experiment(row):
    # Stage 1: Retrieval
    retrieved_docs = await retriever(row["query"])

    # Stage 2: Generation
    response = await generator(row["query"], retrieved_docs)

    return {
        **row,
        "retrieved_docs": retrieved_docs,
        "response": response,
        "num_docs_retrieved": len(retrieved_docs),
        "experiment_name": "multi_stage_v1"
    }

Error Handling in Experiments

Handle errors gracefully to avoid losing partial results:

@experiment()
async def robust_experiment(row):
    try:
        response = await my_system_function(row["input"])
        error = None
    except Exception as e:
        response = None
        error = str(e)

    return {
        **row,
        "response": response,
        "error": error,
        "success": error is None,
        "experiment_name": "robust_v1"
    }

Integrating with Metrics

Experiments work seamlessly with Ragas metrics:

from ragas.metrics import FactualCorrectness

@experiment()
async def evaluated_experiment(row):
    response = await my_system_function(row["input"])

    # Calculate metrics inline
    factual_score = FactualCorrectness().score(
        response=response,
        reference=row["expected_output"]
    )

    return {
        **row,
        "response": response,
        "factual_correctness": factual_score.value,
        "factual_reason": factual_score.reason,
        "experiment_name": "evaluated_v1"
    }

This integration allows you to automatically calculate and store metric scores alongside your experiment results, making it easy to track performance improvements over time.