Metrics¶

`ragas.metrics.answer_relevancy`	Scores the relevancy of the answer according to the given question.
`ragas.metrics.answer_similarity`	Scores the semantic similarity of ground truth with generated answer.
`ragas.metrics.answer_correctness`	Measures answer correctness compared to ground truth as a combination of factuality and semantic similarity.
`ragas.metrics.context_precision`	Average Precision is a metric that evaluates whether all of the relevant items selected by the model are ranked higher or not.
`ragas.metrics.context_recall`	Estimates context recall by estimating TP and FN using annotated answer and retrieved context.
`ragas.metrics.context_entity_recall`	Calculates recall based on entities present in ground truth and context.

class ragas.metrics.AnswerCorrectness(embeddings: t.Optional[BaseRagasEmbeddings] = None, llm: t.Optional[BaseRagasLLM] = None, name: str = 'answer_correctness', evaluation_mode: EvaluationMode = EvaluationMode.qga, correctness_prompt: Prompt = <factory>, weights: list[float] = <factory>, answer_similarity: AnswerSimilarity | None = None, max_retries: int = 1)¶

Measures answer correctness compared to ground truth as a combination of factuality and semantic similarity.

name¶

The name of the metrics

Type:: string

weights¶

a list of two weights corresponding to factuality and semantic similarity Defaults [0.75, 0.25]

Type:: list[float]

answer_similarity¶

The AnswerSimilarity object

Type:: AnswerSimilarity | None

adapt(language: str, cache_dir: str | None = None) → None¶: Adapt the metric to a different language.

init(run_config: RunConfig)¶: Init any models in the metric, this is invoked before evaluate() to load all the models Also check if the api key is valid for OpenAI and AzureOpenAI

save(cache_dir: str | None = None) → None¶: Save the metric to a path.

class ragas.metrics.AnswerRelevancy(embeddings: t.Optional[BaseRagasEmbeddings] = None, llm: t.Optional[BaseRagasLLM] = None, name: str = 'answer_relevancy', evaluation_mode: EvaluationMode = EvaluationMode.qac, question_generation: Prompt = <factory>, strictness: int = 3)¶

Scores the relevancy of the answer according to the given question. Answers with incomplete, redundant or unnecessary information is penalized. Score can range from 0 to 1 with 1 being the best.

name¶

The name of the metrics

Type:: string

strictness¶

Here indicates the number questions generated per answer. Ideal range between 3 to 5.

Type:: int

embeddings¶

The langchain wrapper of Embedding object. E.g. HuggingFaceEmbeddings(‘BAAI/bge-base-en’)

Type:: Embedding

adapt(language: str, cache_dir: str | None = None) → None¶: Adapt the metric to a different language.

save(cache_dir: str | None = None) → None¶: Save the metric to a path.

class ragas.metrics.AnswerSimilarity(embeddings: t.Optional[BaseRagasEmbeddings] = None, llm: t.Optional[BaseRagasLLM] = None, name: str = 'answer_similarity', evaluation_mode: EvaluationMode = EvaluationMode.ga, is_cross_encoder: bool = False, threshold: t.Optional[float] = None)¶

Scores the semantic similarity of ground truth with generated answer. cross encoder score is used to quantify semantic similarity. SAS paper: https://arxiv.org/pdf/2108.06130.pdf

name¶

Type:: str

model_name¶: The model to be used for calculating semantic similarity Defaults open-ai-embeddings select cross-encoder model for best results https://huggingface.co/spaces/mteb/leaderboard

threshold¶

The threshold if given used to map output to binary Default 0.5

Type:: t.Optional[float]

class ragas.metrics.AspectCritique(llm: BaseRagasLLM | None = None, name: str = '', evaluation_mode: EvaluationMode = EvaluationMode.qac, critic_prompt: Prompt = <factory>, definition: str = '', strictness: int = 1, max_retries: int = 1)¶

Judges the submission to give binary results using the criteria specified in the metric definition.

name¶

name of the metrics

Type:: str

definition¶

criteria to judge the submission, example “Is the submission spreading fake information?”

Type:: str

strictness¶

The number of times self consistency checks is made. Final judgement is made using majority vote.

Type:: int

llm¶

llm API of your choice

Type:: LangchainLLM

adapt(language: str, cache_dir: str | None = None) → None¶: Adapt the metric to a different language.

save(cache_dir: str | None = None) → None¶: Save the metric to a path.

class ragas.metrics.ContextEntityRecall(llm: t.Optional[BaseRagasLLM] = None, name: str = 'context_entity_recall', evaluation_mode: EvaluationMode = EvaluationMode.gc, context_entity_recall_prompt: Prompt = <factory>, batch_size: int = 15, max_retries: int = 1)¶

Calculates recall based on entities present in ground truth and context. Let CN be the set of entities present in context, GN be the set of entities present in the ground truth.

Then we define can the context entity recall as follows: Context Entity recall = | CN ∩ GN | / | GN |

If this quantity is 1, we can say that the retrieval mechanism has retrieved context which covers all entities present in the ground truth, thus being a useful retrieval. Thus this can be used to evaluate retrieval mechanisms in specific use cases where entities matter, for example, a tourism help chatbot.

name¶

Type:: str

batch_size¶

Batch size for openai completion.

Type:: int

save(cache_dir: str | None = None) → None¶: Save the metric to a path.

class ragas.metrics.ContextPrecision(llm: t.Optional[BaseRagasLLM] = None, name: str = 'context_precision', evaluation_mode: EvaluationMode = EvaluationMode.qcg, context_precision_prompt: Prompt = <factory>, max_retries: int = 1)¶

Average Precision is a metric that evaluates whether all of the relevant items selected by the model are ranked higher or not.

name¶

Type:: str

evaluation_mode¶

Type:: EvaluationMode

context_precision_prompt¶

Type:: Prompt

adapt(language: str, cache_dir: str | None = None) → None¶: Adapt the metric to a different language.

save(cache_dir: str | None = None) → None¶: Save the metric to a path.

class ragas.metrics.ContextRecall(llm: t.Optional[BaseRagasLLM] = None, name: str = 'context_recall', evaluation_mode: EvaluationMode = EvaluationMode.qcg, context_recall_prompt: Prompt = <factory>, max_retries: int = 1)¶

Estimates context recall by estimating TP and FN using annotated answer and retrieved context.

name¶

Type:: str

adapt(language: str, cache_dir: str | None = None) → None¶: Adapt the metric to a different language.

save(cache_dir: str | None = None) → None¶: Save the metric to a path.

class ragas.metrics.ContextRelevancy(llm: t.Optional[BaseRagasLLM] = None, name: str = 'context_relevancy', evaluation_mode: EvaluationMode = EvaluationMode.qc, context_relevancy_prompt: Prompt = <factory>, show_deprecation_warning: bool = False)¶

Extracts sentences from the context that are relevant to the question with self-consistency checks. The number of relevant sentences and is used as the score.

name¶

Type:: str

adapt(language: str, cache_dir: str | None = None) → None¶: Adapt the metric to a different language.

save(cache_dir: str | None = None) → None¶: Save the metric to a path.

class ragas.metrics.ContextUtilization(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'context_utilization', evaluation_mode: 'EvaluationMode' = <EvaluationMode.qac: 1>, context_precision_prompt: 'Prompt' = <factory>, max_retries: 'int' = 1)¶

class ragas.metrics.Faithfulness(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'faithfulness', evaluation_mode: 'EvaluationMode' = <EvaluationMode.qac: 1>, long_form_answer_prompt: 'Prompt' = <factory>, nli_statements_message: 'Prompt' = <factory>, max_retries: 'int' = 1)¶

adapt(language: str, cache_dir: str | None = None) → None¶: Adapt the metric to a different language.

save(cache_dir: str | None = None) → None¶: Save the metric to a path.

Evaluation

Integrations