Evaluation Dataset

An evaluation dataset is a homogeneous collection of data samples designed to assess the performance and capabilities of an AI application. In Ragas, evaluation datasets are represented using the EvaluationDataset class, which provides a structured way to organize and manage data samples for evaluation purposes.

Structure of an Evaluation Dataset

An evaluation dataset consists of:

Samples: A collection of SingleTurnSample or MultiTurnSample instances. Each sample represents a unique interaction or scenario.
Consistency: All samples within the dataset should be of the same type (either all single-turn or all multi-turn samples) to maintain consistency in evaluation.

Guidelines for Curating an Effective Evaluation Dataset

Define Clear Objectives: Identify the specific aspects of the AI application that you want to evaluate and the scenarios you want to test. Collect data samples that reflect these objectives.
Collect Representative Data: Ensure that the dataset covers a diverse range of scenarios, user inputs, and expected responses to provide a comprehensive evaluation of the AI application. This can be achieved by collecting data from various sources or generating synthetic data.
Quality and Size: Aim for a dataset that is large enough to provide meaningful insights but not so large that it becomes unwieldy. Ensure that the data is of high quality and accurately reflects the real-world scenarios you want to evaluate.

Example

In this example, we’ll demonstrate how to create an EvaluationDataset using multiple SingleTurnSample instances. We’ll walk through the process step by step, including creating individual samples, assembling them into a dataset, and performing basic operations on the dataset.

Step 1: Import Necessary Classes

First, import the SingleTurnSample and EvaluationDataset classes from your module.

from ragas import SingleTurnSample, EvaluationDataset

Step 2: Create Individual Samples

Create several SingleTurnSample instances that represent individual evaluation samples.

# Sample 1
sample1 = SingleTurnSample(
    user_input="What is the capital of Germany?",
    retrieved_contexts=["Berlin is the capital and largest city of Germany."],
    response="The capital of Germany is Berlin.",
    reference="Berlin",
)

# Sample 2
sample2 = SingleTurnSample(
    user_input="Who wrote 'Pride and Prejudice'?",
    retrieved_contexts=["'Pride and Prejudice' is a novel by Jane Austen."],
    response="'Pride and Prejudice' was written by Jane Austen.",
    reference="Jane Austen",
)

# Sample 3
sample3 = SingleTurnSample(
    user_input="What's the chemical formula for water?",
    retrieved_contexts=["Water has the chemical formula H2O."],
    response="The chemical formula for water is H2O.",
    reference="H2O",
)

Step 3: Create the EvaluationDataset Create an EvaluationDataset by passing a list of SingleTurnSample instances.

dataset = EvaluationDataset(samples=[sample1, sample2, sample3])

EvaluationDataset API Reference