Metrics

ragas.metrics.answer_relevancy

Scores the relevancy of the answer according to the given question.

ragas.metrics.answer_similarity

Scores the semantic similarity of ground truth with generated answer.

ragas.metrics.answer_correctness

Measures answer correctness compared to ground truth as a combination of factuality and semantic similarity.

ragas.metrics.context_precision

Average Precision is a metric that evaluates whether all of the relevant items selected by the model are ranked higher or not.

ragas.metrics.context_recall

Estimates context recall by estimating TP and FN using annotated answer and retrieved context.

ragas.metrics.context_entity_recall

Calculates recall based on entities present in ground truth and context.

ragas.metrics.summarization_score

'str' = 'summary_score', max_retries: 'int' = 1, length_penalty: 'bool' = True, coeff: 'float' = 0.5, evaluation_mode: 'EvaluationMode' = <EvaluationMode.ca: 8>, question_generation_prompt: 'Prompt' = <factory>, answer_generation_prompt: 'Prompt' = <factory>, extract_keyphrases_prompt: 'Prompt' = <factory>)

class ragas.metrics.AnswerCorrectness(embeddings: t.Optional[BaseRagasEmbeddings] = None, llm: t.Optional[BaseRagasLLM] = None, name: str = 'answer_correctness', evaluation_mode: EvaluationMode = EvaluationMode.qga, correctness_prompt: Prompt = <factory>, long_form_answer_prompt: Prompt = <factory>, weights: list[float] = <factory>, answer_similarity: t.Optional[AnswerSimilarity] = None, sentence_segmenter: t.Optional[HasSegmentMethod] = None, max_retries: int = 1)

Measures answer correctness compared to ground truth as a combination of factuality and semantic similarity.

name

The name of the metrics

Type:

string

weights

a list of two weights corresponding to factuality and semantic similarity Defaults [0.75, 0.25]

Type:

list[float]

answer_similarity

The AnswerSimilarity object

Type:

t.Optional[AnswerSimilarity]

adapt(language: str, cache_dir: str | None = None) None

Adapt the metric to a different language.

init(run_config: RunConfig)

Init any models in the metric, this is invoked before evaluate() to load all the models Also check if the api key is valid for OpenAI and AzureOpenAI

save(cache_dir: str | None = None) None

Save the metric to a path.

class ragas.metrics.AnswerRelevancy(embeddings: t.Optional[BaseRagasEmbeddings] = None, llm: t.Optional[BaseRagasLLM] = None, name: str = 'answer_relevancy', evaluation_mode: EvaluationMode = EvaluationMode.qac, question_generation: Prompt = <factory>, strictness: int = 3)

Scores the relevancy of the answer according to the given question. Answers with incomplete, redundant or unnecessary information is penalized. Score can range from 0 to 1 with 1 being the best.

name

The name of the metrics

Type:

string

strictness

Here indicates the number questions generated per answer. Ideal range between 3 to 5.

Type:

int

embeddings

The langchain wrapper of Embedding object. E.g. HuggingFaceEmbeddings(‘BAAI/bge-base-en’)

Type:

Embedding

adapt(language: str, cache_dir: str | None = None) None

Adapt the metric to a different language.

save(cache_dir: str | None = None) None

Save the metric to a path.

class ragas.metrics.AnswerSimilarity(embeddings: t.Optional[BaseRagasEmbeddings] = None, llm: t.Optional[BaseRagasLLM] = None, name: str = 'answer_similarity', evaluation_mode: EvaluationMode = EvaluationMode.ga, is_cross_encoder: bool = False, threshold: t.Optional[float] = None)

Scores the semantic similarity of ground truth with generated answer. cross encoder score is used to quantify semantic similarity. SAS paper: https://arxiv.org/pdf/2108.06130.pdf

name
Type:

str

model_name

The model to be used for calculating semantic similarity Defaults open-ai-embeddings select cross-encoder model for best results https://huggingface.co/spaces/mteb/leaderboard

threshold

The threshold if given used to map output to binary Default 0.5

Type:

t.Optional[float]

class ragas.metrics.AspectCritique(llm: BaseRagasLLM | None = None, name: str = '', evaluation_mode: EvaluationMode = EvaluationMode.qac, critic_prompt: Prompt = <factory>, definition: str = '', strictness: int = 1, max_retries: int = 1)

Judges the submission to give binary results using the criteria specified in the metric definition.

name

name of the metrics

Type:

str

definition

criteria to judge the submission, example “Is the submission spreading fake information?”

Type:

str

strictness

The number of times self consistency checks is made. Final judgement is made using majority vote.

Type:

int

llm

llm API of your choice

Type:

LangchainLLM

adapt(language: str, cache_dir: str | None = None) None

Adapt the metric to a different language.

save(cache_dir: str | None = None) None

Save the metric to a path.

class ragas.metrics.ContextEntityRecall(llm: t.Optional[BaseRagasLLM] = None, name: str = 'context_entity_recall', evaluation_mode: EvaluationMode = EvaluationMode.gc, context_entity_recall_prompt: Prompt = <factory>, batch_size: int = 15, max_retries: int = 1)

Calculates recall based on entities present in ground truth and context. Let CN be the set of entities present in context, GN be the set of entities present in the ground truth.

Then we define can the context entity recall as follows: Context Entity recall = | CN ∩ GN | / | GN |

If this quantity is 1, we can say that the retrieval mechanism has retrieved context which covers all entities present in the ground truth, thus being a useful retrieval. Thus this can be used to evaluate retrieval mechanisms in specific use cases where entities matter, for example, a tourism help chatbot.

name
Type:

str

batch_size

Batch size for openai completion.

Type:

int

save(cache_dir: str | None = None) None

Save the metric to a path.

class ragas.metrics.ContextPrecision(llm: t.Optional[BaseRagasLLM] = None, name: str = 'context_precision', evaluation_mode: EvaluationMode = EvaluationMode.qcg, context_precision_prompt: Prompt = <factory>, max_retries: int = 1, _reproducibility: int = 1)

Average Precision is a metric that evaluates whether all of the relevant items selected by the model are ranked higher or not.

name
Type:

str

evaluation_mode
Type:

EvaluationMode

context_precision_prompt
Type:

Prompt

adapt(language: str, cache_dir: str | None = None) None

Adapt the metric to a different language.

save(cache_dir: str | None = None) None

Save the metric to a path.

class ragas.metrics.ContextRecall(llm: t.Optional[BaseRagasLLM] = None, name: str = 'context_recall', evaluation_mode: EvaluationMode = EvaluationMode.qcg, context_recall_prompt: Prompt = <factory>, max_retries: int = 1, _reproducibility: int = 1)

Estimates context recall by estimating TP and FN using annotated answer and retrieved context.

name
Type:

str

adapt(language: str, cache_dir: str | None = None) None

Adapt the metric to a different language.

save(cache_dir: str | None = None) None

Save the metric to a path.

class ragas.metrics.ContextUtilization(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'context_utilization', evaluation_mode: 'EvaluationMode' = <EvaluationMode.qac: 1>, context_precision_prompt: 'Prompt' = <factory>, max_retries: 'int' = 1, _reproducibility: 'int' = 1)
class ragas.metrics.Faithfulness(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'faithfulness', evaluation_mode: 'EvaluationMode' = <EvaluationMode.qac: 1>, nli_statements_message: 'Prompt' = <factory>, statement_prompt: 'Prompt' = <factory>, sentence_segmenter: 't.Optional[HasSegmentMethod]' = None, max_retries: 'int' = 1, _reproducibility: 'int' = 1)
adapt(language: str, cache_dir: str | None = None) None

Adapt the metric to a different language.

save(cache_dir: str | None = None) None

Save the metric to a path.

class ragas.metrics.FaithulnesswithHHEM(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'faithfulness_with_hhem', evaluation_mode: 'EvaluationMode' = <EvaluationMode.qac: 1>, nli_statements_message: 'Prompt' = <factory>, statement_prompt: 'Prompt' = <factory>, sentence_segmenter: 't.Optional[HasSegmentMethod]' = None, max_retries: 'int' = 1, _reproducibility: 'int' = 1, device: 'str' = 'cpu', batch_size: 'int' = 10)
class ragas.metrics.LabelledRubricsScore(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'labelled_rubrics_score', evaluation_mode: 'EvaluationMode' = <EvaluationMode.qcg: 7>, rubrics: 't.Dict[str, str]' = <factory>, scoring_prompt: 'Prompt' = <factory>, max_retries: 'int' = 1)
adapt(language: str, cache_dir: str | None = None) None

Adapt the metric to a different language.

save(cache_dir: str | None = None) None

Save the metric to a path.

class ragas.metrics.NoiseSensitivity(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'noise_sensitivity', focus: 'str' = 'relevant', evaluation_mode: 'EvaluationMode' = <EvaluationMode.qga: 6>, nli_statements_message: 'Prompt' = <factory>, statement_prompt: 'Prompt' = <factory>, sentence_segmenter: 't.Optional[HasSegmentMethod]' = None, max_retries: 'int' = 1, _reproducibility: 'int' = 1)
adapt(language: str, cache_dir: str | None = None) None

Adapt the metric to a different language.

save(cache_dir: str | None = None) None

Save the metric to a path.

class ragas.metrics.ReferenceFreeRubricsScore(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'reference_free_rubrics_score', evaluation_mode: 'EvaluationMode' = <EvaluationMode.qga: 6>, rubrics: 't.Dict[str, str]' = <factory>, scoring_prompt: 'Prompt' = <factory>, max_retries: 'int' = 1)
class ragas.metrics.SummarizationScore(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'summary_score', max_retries: 'int' = 1, length_penalty: 'bool' = True, coeff: 'float' = 0.5, evaluation_mode: 'EvaluationMode' = <EvaluationMode.ca: 8>, question_generation_prompt: 'Prompt' = <factory>, answer_generation_prompt: 'Prompt' = <factory>, extract_keyphrases_prompt: 'Prompt' = <factory>)
adapt(language: str, cache_dir: str | None = None) None

Adapt the metric to a different language.