Evaluation¶
- ragas.evaluation.evaluate(dataset: Dataset, metrics: list[Metric] | None = None, llm: t.Optional[BaseRagasLLM | LangchainLLM] = None, embeddings: t.Optional[BaseRagasEmbeddings | LangchainEmbeddings] = None, callbacks: Callbacks = None, in_ci: bool = False, run_config: RunConfig = RunConfig(timeout=180, max_retries=10, max_wait=60, max_workers=16, exception_types=(<class 'Exception'>, ), log_tenacity=False, seed=42), token_usage_parser: t.Optional[TokenUsageParser] = None, raise_exceptions: bool = False, column_map: t.Optional[t.Dict[str, str]] = None) Result ¶
Run the evaluation on the dataset with different metrics
- Parameters:
dataset (Dataset[question: list[str], contexts: list[list[str]], answer: list[str], ground_truth: list[list[str]]]) – The dataset in the format of ragas which the metrics will use to score the RAG pipeline with
metrics (list[Metric] , optional) – List of metrics to use for evaluation. If not provided then ragas will run the evaluation on the best set of metrics to give a complete view.
llm (BaseRagasLLM, optional) – The language model to use for the metrics. If not provided then ragas will use the default language model for metrics which require an LLM. This can we overridden by the llm specified in the metric level with metric.llm.
embeddings (BaseRagasEmbeddings, optional) – The embeddings to use for the metrics. If not provided then ragas will use the default embeddings for metrics which require embeddings. This can we overridden by the embeddings specified in the metric level with metric.embeddings.
callbacks (Callbacks, optional) – Lifecycle Langchain Callbacks to run during evaluation. Check the [langchain documentation](https://python.langchain.com/docs/modules/callbacks/) for more information.
in_ci (bool) – Whether the evaluation is running in CI or not. If set to True then some metrics will be run to increase the reproducability of the evaluations. This will increase the runtime and cost of evaluations. Default is False.
run_config (RunConfig, optional) – Configuration for runtime settings like timeout and retries. If not provided, default values are used.
token_usage_parser (TokenUsageParser, optional) – Parser to get the token usage from the LLM result. If not provided then the the cost and total tokens will not be calculated. Default is None.
raise_exceptions (False) – Whether to raise exceptions or not. If set to True then the evaluation will raise an exception if any of the metrics fail. If set to False then the evaluation will return np.nan for the row that failed. Default is False.
column_map (dict[str, str], optional) – The column names of the dataset to use for evaluation. If the column names of the dataset are different from the default ones then you can provide the mapping as a dictionary here. Example: If the dataset column name is contexts_v1, column_map can be given as {“contexts”:”contexts_v1”}
- Returns:
Result object containing the scores of each metric. You can use this do analysis later.
- Return type:
- Raises:
ValueError – if validation fails because the columns required for the metrics are missing or if the columns are of the wrong format.
Examples
the basic usage is as follows: ``` from ragas import evaluate
>>> dataset Dataset({ features: ['question', 'ground_truth', 'answer', 'contexts'], num_rows: 30 })
>>> result = evaluate(dataset) >>> print(result) {'context_precision': 0.817, 'faithfulness': 0.892, 'answer_relevancy': 0.874} ```
- class ragas.evaluation.Result(scores: 'Dataset', dataset: 't.Optional[Dataset]' = None, binary_columns: 't.List[str]' = <factory>, cost_cb: 't.Optional[CostCallbackHandler]' = None)¶