Metrics

ragas.metrics.answer_relevancy

Scores the relevancy of the answer according to the given question.

ragas.metrics.answer_similarity

Scores the semantic similarity of ground truth with generated answer.

ragas.metrics.answer_correctness

Measures answer correctness compared to ground truth as a combination of factuality and semantic similarity.

ragas.metrics.context_precision

Average Precision is a metric that evaluates whether all of the relevant items selected by the model are ranked higher or not.

ragas.metrics.context_recall

Estimates context recall by estimating TP and FN using annotated answer and retrieved context.

class ragas.metrics.AnswerCorrectness(embeddings: t.Optional[BaseRagasEmbeddings] = None, llm: t.Optional[BaseRagasLLM] = None, name: str = 'answer_correctness', evaluation_mode: EvaluationMode = EvaluationMode.qga, correctness_prompt: Prompt = <factory>, weights: list[float] = <factory>, answer_similarity: AnswerSimilarity | None = None)

Measures answer correctness compared to ground truth as a combination of factuality and semantic similarity.

name

The name of the metrics

Type:

string

weights

a list of two weights corresponding to factuality and semantic similarity Defaults [0.75, 0.25]

Type:

list[float]

answer_similarity

The AnswerSimilarity object

Type:

AnswerSimilarity | None

adapt(language: str, cache_dir: str | None = None) None

Adapt the metric to a different language.

init(run_config: RunConfig)

Init any models in the metric, this is invoked before evaluate() to load all the models Also check if the api key is valid for OpenAI and AzureOpenAI

save(cache_dir: str | None = None) None

Save the metric to a path.

class ragas.metrics.AnswerRelevancy(embeddings: t.Optional[BaseRagasEmbeddings] = None, llm: t.Optional[BaseRagasLLM] = None, name: str = 'answer_relevancy', evaluation_mode: EvaluationMode = EvaluationMode.qac, question_generation: Prompt = <factory>, strictness: int = 3)

Scores the relevancy of the answer according to the given question. Answers with incomplete, redundant or unnecessary information is penalized. Score can range from 0 to 1 with 1 being the best.

name

The name of the metrics

Type:

string

strictness

Here indicates the number questions generated per answer. Ideal range between 3 to 5.

Type:

int

embeddings

The langchain wrapper of Embedding object. E.g. HuggingFaceEmbeddings(‘BAAI/bge-base-en’)

Type:

Embedding

adapt(language: str, cache_dir: str | None = None) None

Adapt the metric to a different language.

save(cache_dir: str | None = None) None

Save the metric to a path.

class ragas.metrics.AnswerSimilarity(embeddings: t.Optional[BaseRagasEmbeddings] = None, llm: t.Optional[BaseRagasLLM] = None, name: str = 'answer_similarity', evaluation_mode: EvaluationMode = EvaluationMode.ga, is_cross_encoder: bool = False, threshold: t.Optional[float] = None)

Scores the semantic similarity of ground truth with generated answer. cross encoder score is used to quantify semantic similarity. SAS paper: https://arxiv.org/pdf/2108.06130.pdf

name
Type:

str

model_name

The model to be used for calculating semantic similarity Defaults open-ai-embeddings select cross-encoder model for best results https://huggingface.co/spaces/mteb/leaderboard

threshold

The threshold if given used to map output to binary Default 0.5

Type:

t.Optional[float]

class ragas.metrics.AspectCritique(llm: BaseRagasLLM | None = None, name: str = '', evaluation_mode: EvaluationMode = EvaluationMode.qac, critic_prompt: Prompt = <factory>, definition: str = '', strictness: int = 1)

Judges the submission to give binary results using the criteria specified in the metric definition.

name

name of the metrics

Type:

str

definition

criteria to judge the submission, example “Is the submission spreading fake information?”

Type:

str

strictness

The number of times self consistency checks is made. Final judgement is made using majority vote.

Type:

int

llm

llm API of your choice

Type:

LangchainLLM

adapt(language: str, cache_dir: str | None = None) None

Adapt the metric to a different language.

save(cache_dir: str | None = None) None

Save the metric to a path.

class ragas.metrics.ContextPrecision(llm: t.Optional[BaseRagasLLM] = None, name: str = 'context_precision', evaluation_mode: EvaluationMode = EvaluationMode.qcg, context_precision_prompt: Prompt = <factory>)

Average Precision is a metric that evaluates whether all of the relevant items selected by the model are ranked higher or not.

name
Type:

str

evaluation_mode
Type:

EvaluationMode

context_precision_prompt
Type:

Prompt

adapt(language: str, cache_dir: str | None = None) None

Adapt the metric to a different language.

save(cache_dir: str | None = None) None

Save the metric to a path.

class ragas.metrics.ContextRecall(llm: t.Optional[BaseRagasLLM] = None, name: str = 'context_recall', evaluation_mode: EvaluationMode = EvaluationMode.qcg, context_recall_prompt: Prompt = <factory>)

Estimates context recall by estimating TP and FN using annotated answer and retrieved context.

name
Type:

str

adapt(language: str, cache_dir: str | None = None) None

Adapt the metric to a different language.

save(cache_dir: str | None = None) None

Save the metric to a path.

class ragas.metrics.ContextRelevancy(llm: t.Optional[BaseRagasLLM] = None, name: str = 'context_relevancy', evaluation_mode: EvaluationMode = EvaluationMode.qc, context_relevancy_prompt: Prompt = <factory>, show_deprecation_warning: bool = False)

Extracts sentences from the context that are relevant to the question with self-consistency checks. The number of relevant sentences and is used as the score.

name
Type:

str

adapt(language: str, cache_dir: str | None = None) None

Adapt the metric to a different language.

save(cache_dir: str | None = None) None

Save the metric to a path.

class ragas.metrics.ContextUtilization(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'context_utilization', evaluation_mode: 'EvaluationMode' = <EvaluationMode.qac: 1>, context_precision_prompt: 'Prompt' = <factory>)
class ragas.metrics.Faithfulness(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'faithfulness', evaluation_mode: 'EvaluationMode' = <EvaluationMode.qac: 1>, long_form_answer_prompt: 'Prompt' = <factory>, nli_statements_message: 'Prompt' = <factory>)
adapt(language: str, cache_dir: str | None = None) None

Adapt the metric to a different language.

save(cache_dir: str | None = None) None

Save the metric to a path.