Metrics¶
|
Scores the relevancy of the answer according to the given question. |
|
Scores the semantic similarity of ground truth with generated answer. |
|
Measures answer correctness compared to ground truth as a combination of factuality and semantic similarity. |
|
Average Precision is a metric that evaluates whether all of the relevant items selected by the model are ranked higher or not. |
|
Estimates context recall by estimating TP and FN using annotated answer and retrieved context. |
|
Calculates recall based on entities present in ground truth and context. |
|
'str' = 'summary_score', max_retries: 'int' = 1, length_penalty: 'bool' = True, coeff: 'float' = 0.5, evaluation_mode: 'EvaluationMode' = <EvaluationMode.ca: 8>, question_generation_prompt: 'Prompt' = <factory>, answer_generation_prompt: 'Prompt' = <factory>, extract_keyphrases_prompt: 'Prompt' = <factory>) |
- class ragas.metrics.AnswerCorrectness(embeddings: t.Optional[BaseRagasEmbeddings] = None, llm: t.Optional[BaseRagasLLM] = None, name: str = 'answer_correctness', evaluation_mode: EvaluationMode = EvaluationMode.qga, correctness_prompt: Prompt = <factory>, long_form_answer_prompt: Prompt = <factory>, weights: list[float] = <factory>, answer_similarity: t.Optional[AnswerSimilarity] = None, sentence_segmenter: t.Optional[HasSegmentMethod] = None, max_retries: int = 1)¶
Measures answer correctness compared to ground truth as a combination of factuality and semantic similarity.
- name¶
The name of the metrics
- Type:
string
- weights¶
a list of two weights corresponding to factuality and semantic similarity Defaults [0.75, 0.25]
- Type:
list[float]
- answer_similarity¶
The AnswerSimilarity object
- Type:
t.Optional[AnswerSimilarity]
- adapt(language: str, cache_dir: str | None = None) None ¶
Adapt the metric to a different language.
- init(run_config: RunConfig)¶
Init any models in the metric, this is invoked before evaluate() to load all the models Also check if the api key is valid for OpenAI and AzureOpenAI
- save(cache_dir: str | None = None) None ¶
Save the metric to a path.
- class ragas.metrics.AnswerRelevancy(embeddings: t.Optional[BaseRagasEmbeddings] = None, llm: t.Optional[BaseRagasLLM] = None, name: str = 'answer_relevancy', evaluation_mode: EvaluationMode = EvaluationMode.qac, question_generation: Prompt = <factory>, strictness: int = 3)¶
Scores the relevancy of the answer according to the given question. Answers with incomplete, redundant or unnecessary information is penalized. Score can range from 0 to 1 with 1 being the best.
- name¶
The name of the metrics
- Type:
string
- strictness¶
Here indicates the number questions generated per answer. Ideal range between 3 to 5.
- Type:
int
- embeddings¶
The langchain wrapper of Embedding object. E.g. HuggingFaceEmbeddings(‘BAAI/bge-base-en’)
- Type:
Embedding
- adapt(language: str, cache_dir: str | None = None) None ¶
Adapt the metric to a different language.
- save(cache_dir: str | None = None) None ¶
Save the metric to a path.
- class ragas.metrics.AnswerSimilarity(embeddings: t.Optional[BaseRagasEmbeddings] = None, llm: t.Optional[BaseRagasLLM] = None, name: str = 'answer_similarity', evaluation_mode: EvaluationMode = EvaluationMode.ga, is_cross_encoder: bool = False, threshold: t.Optional[float] = None)¶
Scores the semantic similarity of ground truth with generated answer. cross encoder score is used to quantify semantic similarity. SAS paper: https://arxiv.org/pdf/2108.06130.pdf
- name¶
- Type:
str
- model_name¶
The model to be used for calculating semantic similarity Defaults open-ai-embeddings select cross-encoder model for best results https://huggingface.co/spaces/mteb/leaderboard
- threshold¶
The threshold if given used to map output to binary Default 0.5
- Type:
t.Optional[float]
- class ragas.metrics.AspectCritique(llm: BaseRagasLLM | None = None, name: str = '', evaluation_mode: EvaluationMode = EvaluationMode.qac, critic_prompt: Prompt = <factory>, definition: str = '', strictness: int = 1, max_retries: int = 1)¶
Judges the submission to give binary results using the criteria specified in the metric definition.
- name¶
name of the metrics
- Type:
str
- definition¶
criteria to judge the submission, example “Is the submission spreading fake information?”
- Type:
str
- strictness¶
The number of times self consistency checks is made. Final judgement is made using majority vote.
- Type:
int
- llm¶
llm API of your choice
- Type:
LangchainLLM
- adapt(language: str, cache_dir: str | None = None) None ¶
Adapt the metric to a different language.
- save(cache_dir: str | None = None) None ¶
Save the metric to a path.
- class ragas.metrics.ContextEntityRecall(llm: t.Optional[BaseRagasLLM] = None, name: str = 'context_entity_recall', evaluation_mode: EvaluationMode = EvaluationMode.gc, context_entity_recall_prompt: Prompt = <factory>, batch_size: int = 15, max_retries: int = 1)¶
Calculates recall based on entities present in ground truth and context. Let CN be the set of entities present in context, GN be the set of entities present in the ground truth.
Then we define can the context entity recall as follows: Context Entity recall = | CN ∩ GN | / | GN |
If this quantity is 1, we can say that the retrieval mechanism has retrieved context which covers all entities present in the ground truth, thus being a useful retrieval. Thus this can be used to evaluate retrieval mechanisms in specific use cases where entities matter, for example, a tourism help chatbot.
- name¶
- Type:
str
- batch_size¶
Batch size for openai completion.
- Type:
int
- save(cache_dir: str | None = None) None ¶
Save the metric to a path.
- class ragas.metrics.ContextPrecision(llm: t.Optional[BaseRagasLLM] = None, name: str = 'context_precision', evaluation_mode: EvaluationMode = EvaluationMode.qcg, context_precision_prompt: Prompt = <factory>, max_retries: int = 1, _reproducibility: int = 1)¶
Average Precision is a metric that evaluates whether all of the relevant items selected by the model are ranked higher or not.
- name¶
- Type:
str
- evaluation_mode¶
- Type:
EvaluationMode
- context_precision_prompt¶
- Type:
Prompt
- adapt(language: str, cache_dir: str | None = None) None ¶
Adapt the metric to a different language.
- save(cache_dir: str | None = None) None ¶
Save the metric to a path.
- class ragas.metrics.ContextRecall(llm: t.Optional[BaseRagasLLM] = None, name: str = 'context_recall', evaluation_mode: EvaluationMode = EvaluationMode.qcg, context_recall_prompt: Prompt = <factory>, max_retries: int = 1, _reproducibility: int = 1)¶
Estimates context recall by estimating TP and FN using annotated answer and retrieved context.
- name¶
- Type:
str
- adapt(language: str, cache_dir: str | None = None) None ¶
Adapt the metric to a different language.
- save(cache_dir: str | None = None) None ¶
Save the metric to a path.
- class ragas.metrics.ContextUtilization(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'context_utilization', evaluation_mode: 'EvaluationMode' = <EvaluationMode.qac: 1>, context_precision_prompt: 'Prompt' = <factory>, max_retries: 'int' = 1, _reproducibility: 'int' = 1)¶
- class ragas.metrics.Faithfulness(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'faithfulness', evaluation_mode: 'EvaluationMode' = <EvaluationMode.qac: 1>, nli_statements_message: 'Prompt' = <factory>, statement_prompt: 'Prompt' = <factory>, sentence_segmenter: 't.Optional[HasSegmentMethod]' = None, max_retries: 'int' = 1, _reproducibility: 'int' = 1)¶
- adapt(language: str, cache_dir: str | None = None) None ¶
Adapt the metric to a different language.
- save(cache_dir: str | None = None) None ¶
Save the metric to a path.
- class ragas.metrics.FaithulnesswithHHEM(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'faithfulness_with_hhem', evaluation_mode: 'EvaluationMode' = <EvaluationMode.qac: 1>, nli_statements_message: 'Prompt' = <factory>, statement_prompt: 'Prompt' = <factory>, sentence_segmenter: 't.Optional[HasSegmentMethod]' = None, max_retries: 'int' = 1, _reproducibility: 'int' = 1, device: 'str' = 'cpu', batch_size: 'int' = 10)¶
- class ragas.metrics.LabelledRubricsScore(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'labelled_rubrics_score', evaluation_mode: 'EvaluationMode' = <EvaluationMode.qcg: 7>, rubrics: 't.Dict[str, str]' = <factory>, scoring_prompt: 'Prompt' = <factory>, max_retries: 'int' = 1)¶
- adapt(language: str, cache_dir: str | None = None) None ¶
Adapt the metric to a different language.
- save(cache_dir: str | None = None) None ¶
Save the metric to a path.
- class ragas.metrics.NoiseSensitivity(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'noise_sensitivity', focus: 'str' = 'relevant', evaluation_mode: 'EvaluationMode' = <EvaluationMode.qga: 6>, nli_statements_message: 'Prompt' = <factory>, statement_prompt: 'Prompt' = <factory>, sentence_segmenter: 't.Optional[HasSegmentMethod]' = None, max_retries: 'int' = 1, _reproducibility: 'int' = 1)¶
- adapt(language: str, cache_dir: str | None = None) None ¶
Adapt the metric to a different language.
- save(cache_dir: str | None = None) None ¶
Save the metric to a path.
- class ragas.metrics.ReferenceFreeRubricsScore(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'reference_free_rubrics_score', evaluation_mode: 'EvaluationMode' = <EvaluationMode.qga: 6>, rubrics: 't.Dict[str, str]' = <factory>, scoring_prompt: 'Prompt' = <factory>, max_retries: 'int' = 1)¶
- class ragas.metrics.SummarizationScore(llm: 't.Optional[BaseRagasLLM]' = None, name: 'str' = 'summary_score', max_retries: 'int' = 1, length_penalty: 'bool' = True, coeff: 'float' = 0.5, evaluation_mode: 'EvaluationMode' = <EvaluationMode.ca: 8>, question_generation_prompt: 'Prompt' = <factory>, answer_generation_prompt: 'Prompt' = <factory>, extract_keyphrases_prompt: 'Prompt' = <factory>)¶
- adapt(language: str, cache_dir: str | None = None) None ¶
Adapt the metric to a different language.