Evaluation

ragas.evaluation.evaluate(dataset: Dataset, metrics: list[ragas.metrics.base.Metric] | None = None, column_map: dict[str, str] = {'answer': 'answer', 'contexts': 'contexts', 'ground_truths': 'ground_truths', 'question': 'question'}) Result

Run the evaluation on the dataset with different metrics

Parameters:
  • dataset (Dataset[question: list[str], contexts: list[list[str]], answer: list[str]]) – The dataset in the format of ragas which the metrics will use to score the RAG pipeline with

  • metrics (list[Metric] , optional) – List of metrics to use for evaluation. If not provided then ragas will run the evaluation on the best set of metrics to give a complete view.

  • column_map (dict[str, str], optional) – The column names of the dataset to use for evaluation. If the column names of the dataset are different from the default ones then you can provide the mapping as a dictionary here.

Returns:

Result object containing the scores of each metric. You can use this do analysis later. If the top 3 metrics are provided then it also returns the ragas_score for the entire pipeline.

Return type:

Result

Raises:

ValueError – if validation fails because the columns required for the metrics are missing or if the columns are of the wrong format.

Examples

the basic usage is as follows: ``` from ragas import evaluate

>>> dataset
Dataset({
    features: ['question', 'ground_truths', 'answer', 'contexts'],
    num_rows: 30
})
>>> result = evaluate(dataset)
>>> print(result["ragas_score"])
{'ragas_score': 0.860, 'context_precision': 0.817, 'faithfulness': 0.892,
'answer_relevancy': 0.874}
```
class ragas.evaluation.Result(scores: 'Dataset', dataset: 'Dataset | None' = None, binary_columns: 'list[str]' = <factory>)