Evaluation

ragas.evaluation.evaluate(dataset: Dataset, metrics: list[Metric] | None = None, llm: t.Optional[BaseRagasLLM | LangchainLLM] = None, embeddings: t.Optional[BaseRagasEmbeddings | LangchainEmbeddings] = None, callbacks: Callbacks = None, is_async: bool = False, run_config: t.Optional[RunConfig] = None, raise_exceptions: bool = True, column_map: t.Optional[t.Dict[str, str]] = None) Result

Run the evaluation on the dataset with different metrics

Parameters:
  • dataset (Dataset[question: list[str], contexts: list[list[str]], answer: list[str], ground_truth: list[list[str]]]) – The dataset in the format of ragas which the metrics will use to score the RAG pipeline with

  • metrics (list[Metric] , optional) – List of metrics to use for evaluation. If not provided then ragas will run the evaluation on the best set of metrics to give a complete view.

  • llm (BaseRagasLLM, optional) – The language model to use for the metrics. If not provided then ragas will use the default language model for metrics which require an LLM. This can we overridden by the llm specified in the metric level with metric.llm.

  • embeddings (BaseRagasEmbeddings, optional) – The embeddings to use for the metrics. If not provided then ragas will use the default embeddings for metrics which require embeddings. This can we overridden by the embeddings specified in the metric level with metric.embeddings.

  • callbacks (Callbacks, optional) – Lifecycle Langchain Callbacks to run during evaluation. Check the [langchain documentation](https://python.langchain.com/docs/modules/callbacks/) for more information.

  • is_async (bool, optional) – Whether to run the evaluation in async mode or not. If set to True then the evaluation is run by calling the metric.ascore method. In case the llm or embeddings does not support async then the evaluation can be run in sync mode with is_async=False. Default is False.

  • run_config (RunConfig, optional) – Configuration for runtime settings like timeout and retries. If not provided, default values are used.

  • raise_exceptions (True) – Whether to raise exceptions or not. If set to True then the evaluation will raise an exception if any of the metrics fail. If set to False then the evaluation will return np.nan for the row that failed. Default is True.

  • column_map (dict[str, str], optional) – The column names of the dataset to use for evaluation. If the column names of the dataset are different from the default ones then you can provide the mapping as a dictionary here. Example: If the dataset column name is contexts_v1, column_map can be given as {“contexts”:”contexts_v1”}

Returns:

Result object containing the scores of each metric. You can use this do analysis later.

Return type:

Result

Raises:

ValueError – if validation fails because the columns required for the metrics are missing or if the columns are of the wrong format.

Examples

the basic usage is as follows: ``` from ragas import evaluate

>>> dataset
Dataset({
    features: ['question', 'ground_truth', 'answer', 'contexts'],
    num_rows: 30
})
>>> result = evaluate(dataset)
>>> print(result)
{'context_precision': 0.817,
'faithfulness': 0.892,
'answer_relevancy': 0.874}
```
class ragas.evaluation.Result(scores: 'Dataset', dataset: 't.Optional[Dataset]' = None, binary_columns: 't.List[str]' = <factory>)