Metrics
MetricType
Bases: Enum
Enumeration of metric types in Ragas.
Attributes:
| Name | Type | Description |
|---|---|---|
SINGLE_TURN |
str
|
Represents a single-turn metric type. |
MULTI_TURN |
str
|
Represents a multi-turn metric type. |
Metric
dataclass
Metric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: ABC
Abstract base class for metrics in Ragas.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
required_columns |
Dict[str, Set[str]]
|
A dictionary mapping metric type names to sets of required column names. This is
a property and raises |
score
Calculates the score for a single row of data.
Note
This method is deprecated and will be removed in 0.3. Please use single_turn_ascore or multi_turn_ascore instead.
Source code in src/ragas/metrics/base.py
ascore
async
Asynchronously calculates the score for a single row of data.
Note
This method is deprecated and will be removed in 0.3. Please use single_turn_ascore instead.
Source code in src/ragas/metrics/base.py
MetricWithLLM
dataclass
MetricWithLLM(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None)
Bases: Metric, PromptMixin
A metric class that uses a language model for evaluation.
Attributes:
| Name | Type | Description |
|---|---|---|
llm |
Optional[BaseRagasLLM]
|
The language model used for the metric. |
train
train(path: str, demonstration_config: Optional[DemonstrationConfig] = None, instruction_config: Optional[InstructionConfig] = None, callbacks: Optional[Callbacks] = None, run_config: Optional[RunConfig] = None, batch_size: Optional[int] = None, with_debugging_logs=False, raise_exceptions: bool = True) -> None
Train the metric using local JSON data
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to local JSON training data file |
required |
demonstration_config
|
DemonstrationConfig
|
Configuration for demonstration optimization |
None
|
instruction_config
|
InstructionConfig
|
Configuration for instruction optimization |
None
|
callbacks
|
Callbacks
|
List of callback functions |
None
|
run_config
|
RunConfig
|
Run configuration |
None
|
batch_size
|
int
|
Batch size for training |
None
|
with_debugging_logs
|
bool
|
Enable debugging logs |
False
|
raise_exceptions
|
bool
|
Whether to raise exceptions during training |
True
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If path is not provided or not a JSON file |
Source code in src/ragas/metrics/base.py
SingleTurnMetric
dataclass
SingleTurnMetric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: Metric
A metric class for evaluating single-turn interactions.
This class provides methods to score single-turn samples, both synchronously and asynchronously.
single_turn_score
single_turn_score(sample: SingleTurnSample, callbacks: Callbacks = None) -> float
Synchronously score a single-turn sample.
May raise ImportError if nest_asyncio is not installed in a Jupyter-like environment.
Source code in src/ragas/metrics/base.py
single_turn_ascore
async
single_turn_ascore(sample: SingleTurnSample, callbacks: Callbacks = None, timeout: Optional[float] = None) -> float
Asynchronously score a single-turn sample with an optional timeout.
May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.
Source code in src/ragas/metrics/base.py
MultiTurnMetric
dataclass
MultiTurnMetric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: Metric
A metric class for evaluating multi-turn conversations.
This class extends the base Metric class to provide functionality for scoring multi-turn conversation samples.
multi_turn_score
multi_turn_score(sample: MultiTurnSample, callbacks: Callbacks = None) -> float
Score a multi-turn conversation sample synchronously.
May raise ImportError if nest_asyncio is not installed in Jupyter-like environments.
Source code in src/ragas/metrics/base.py
multi_turn_ascore
async
multi_turn_ascore(sample: MultiTurnSample, callbacks: Callbacks = None, timeout: Optional[float] = None) -> float
Score a multi-turn conversation sample asynchronously.
May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.
Source code in src/ragas/metrics/base.py
Ensember
Combine multiple llm outputs for same input (n>1) to a single output
from_discrete
Simple majority voting for binary values, ie [0,0,1] -> 0 inputs: list of list of dicts each containing verdict for a single input
Source code in src/ragas/metrics/base.py
AnswerCorrectness
dataclass
AnswerCorrectness(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'response', 'reference'}})(), name: str = 'answer_correctness', embeddings: Optional[Union[BaseRagasEmbeddings, BaseRagasEmbedding]] = None, llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None, correctness_prompt: PydanticPrompt = CorrectnessClassifier(), statement_generator_prompt: PydanticPrompt = StatementGeneratorPrompt(), weights: list[float] = (lambda: [0.75, 0.25])(), beta: float = 1.0, answer_similarity: Optional[AnswerSimilarity] = None, max_retries: int = 1)
Bases: MetricWithLLM, MetricWithEmbeddings, SingleTurnMetric
Measures answer correctness compared to ground truth as a combination of factuality and semantic similarity.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
string
|
The name of the metrics |
weights |
list[float]
|
a list of two weights corresponding to factuality and semantic similarity Defaults [0.75, 0.25] |
answer_similarity |
Optional[AnswerSimilarity]
|
The AnswerSimilarity object |
ResponseRelevancy
dataclass
ResponseRelevancy(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'response'}})(), name: str = 'answer_relevancy', embeddings: Optional[Union[BaseRagasEmbeddings, BaseRagasEmbedding]] = None, llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None, question_generation: PydanticPrompt = ResponseRelevancePrompt(), strictness: int = 3)
Bases: MetricWithLLM, MetricWithEmbeddings, SingleTurnMetric
Scores the relevancy of the answer according to the given question. Answers with incomplete, redundant or unnecessary information is penalized. Score can range from 0 to 1 with 1 being the best.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
string
|
The name of the metrics |
strictness |
int
|
Here indicates the number questions generated per answer. Ideal range between 3 to 5. |
embeddings |
Embedding
|
The langchain wrapper of Embedding object. E.g. HuggingFaceEmbeddings('BAAI/bge-base-en') |
SemanticSimilarity
dataclass
SemanticSimilarity(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'reference', 'response'}})(), name: str = 'semantic_similarity', embeddings: Optional[Union[BaseRagasEmbeddings, BaseRagasEmbedding]] = None, is_cross_encoder: bool = False, threshold: Optional[float] = None)
Bases: MetricWithEmbeddings, SingleTurnMetric
Scores the semantic similarity of ground truth with generated answer. cross encoder score is used to quantify semantic similarity. SAS paper: https://arxiv.org/pdf/2108.06130.pdf
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
|
model_name |
The model to be used for calculating semantic similarity Defaults open-ai-embeddings select cross-encoder model for best results https://huggingface.co/spaces/mteb/leaderboard |
|
threshold |
Optional[float]
|
The threshold if given used to map output to binary Default 0.5 |
AspectCritic
AspectCritic(name: str, definition: str, llm: Optional[BaseRagasLLM] = None, required_columns: Optional[Dict[MetricType, Set[str]]] = None, output_type: Optional[MetricOutputType] = BINARY, single_turn_prompt: Optional[PydanticPrompt] = None, multi_turn_prompt: Optional[PydanticPrompt] = None, strictness: int = 1, max_retries: int = 1)
Bases: MetricWithLLM, SingleTurnMetric, MultiTurnMetric
Judges the submission to give binary results using the criteria specified in the metric definition.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
name of the metrics |
definition |
str
|
criteria to judge the submission, example "Is the submission spreading fake information?" |
strictness |
int
|
The number of times self consistency checks is made. Final judgement is made using majority vote. |
Source code in src/ragas/metrics/_aspect_critic.py
ContextEntityRecall
dataclass
ContextEntityRecall(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'reference', 'retrieved_contexts'}})(), name: str = 'context_entity_recall', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None, context_entity_recall_prompt: PydanticPrompt = ExtractEntitiesPrompt(), max_retries: int = 1)
Bases: MetricWithLLM, SingleTurnMetric
Calculates recall based on entities present in ground truth and context. Let CN be the set of entities present in context, GN be the set of entities present in the ground truth.
Then we define can the context entity recall as follows: Context Entity recall = | CN โฉ GN | / | GN |
If this quantity is 1, we can say that the retrieval mechanism has retrieved context which covers all entities present in the ground truth, thus being a useful retrieval. Thus this can be used to evaluate retrieval mechanisms in specific use cases where entities matter, for example, a tourism help chatbot.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
|
batch_size |
int
|
Batch size for openai completion. |
LLMContextPrecisionWithReference
dataclass
LLMContextPrecisionWithReference(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'retrieved_contexts', 'reference'}})(), name: str = 'llm_context_precision_with_reference', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None, context_precision_prompt: PydanticPrompt = ContextPrecisionPrompt(), max_retries: int = 1)
Bases: MetricWithLLM, SingleTurnMetric
Average Precision is a metric that evaluates whether all of the relevant items selected by the model are ranked higher or not.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
|
evaluation_mode |
EvaluationMode
|
|
context_precision_prompt |
Prompt
|
|
LLMContextRecall
dataclass
LLMContextRecall(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'retrieved_contexts', 'reference'}})(), name: str = 'context_recall', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = CONTINUOUS, context_recall_prompt: PydanticPrompt = ContextRecallClassificationPrompt(), max_retries: int = 1)
Bases: MetricWithLLM, SingleTurnMetric
Estimates context recall by estimating TP and FN using annotated answer and retrieved context.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
|
FactualCorrectness
dataclass
FactualCorrectness(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'response', 'reference'}})(), name: str = 'factual_correctness', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = CONTINUOUS, mode: Literal['precision', 'recall', 'f1'] = 'f1', beta: float = 1.0, atomicity: Literal['low', 'high'] = 'low', coverage: Literal['low', 'high'] = 'low', claim_decomposition_prompt: PydanticPrompt = ClaimDecompositionPrompt(), nli_prompt: PydanticPrompt = NLIStatementPrompt(), language: str = 'english')
Bases: MetricWithLLM, SingleTurnMetric
FactualCorrectness is a metric class that evaluates the factual correctness of responses generated by a language model. It uses claim decomposition and natural language inference (NLI) to verify the claims made in the responses against reference texts.
Attributes: name (str): The name of the metric, default is "factual_correctness". _required_columns (Dict[MetricType, Set[str]]): A dictionary specifying the required columns for each metric type. Default is {"SINGLE_TURN": {"response", "reference"}}. mode (Literal["precision", "recall", "f1"]): The mode of evaluation, can be "precision", "recall", or "f1". Default is "f1". beta (float): The beta value used for the F1 score calculation. A beta > 1 gives more weight to recall, while beta < 1 favors precision. Default is 1.0. atomicity (Literal["low", "high"]): The level of atomicity for claim decomposition. Default is "low". coverage (Literal["low", "high"]): The level of coverage for claim decomposition. Default is "low". claim_decomposition_prompt (PydanticPrompt): The prompt used for claim decomposition. nli_prompt (PydanticPrompt): The prompt used for natural language inference (NLI).
Faithfulness
dataclass
Faithfulness(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'response', 'retrieved_contexts'}})(), name: str = 'faithfulness', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = CONTINUOUS, nli_statements_prompt: PydanticPrompt = NLIStatementPrompt(), statement_generator_prompt: PydanticPrompt = StatementGeneratorPrompt(), max_retries: int = 1)
Bases: MetricWithLLM, SingleTurnMetric
FaithfulnesswithHHEM
dataclass
FaithfulnesswithHHEM(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'response', 'retrieved_contexts'}})(), name: str = 'faithfulness_with_hhem', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = CONTINUOUS, nli_statements_prompt: PydanticPrompt = NLIStatementPrompt(), statement_generator_prompt: PydanticPrompt = StatementGeneratorPrompt(), max_retries: int = 1, device: str = 'cpu', batch_size: int = 10)
Bases: Faithfulness
NoiseSensitivity
dataclass
NoiseSensitivity(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'response', 'reference', 'retrieved_contexts'}})(), name: str = 'noise_sensitivity', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = CONTINUOUS, mode: Literal['relevant', 'irrelevant'] = 'relevant', nli_statements_prompt: PydanticPrompt = NLIStatementPrompt(), statement_generator_prompt: PydanticPrompt = StatementGeneratorPrompt(), max_retries: int = 1)
Bases: MetricWithLLM, SingleTurnMetric
AnswerAccuracy
dataclass
AnswerAccuracy(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'response', 'reference'}})(), name: str = 'nv_accuracy', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None)
Bases: MetricWithLLM, SingleTurnMetric
Measures answer accuracy compared to ground truth given a user_input. This metric averages two distinct judge prompts to evaluate.
Top10, Zero-shoot LLM-as-a-Judge Leaderboard: 1)- mistralai/mixtral-8x22b-instruct-v0.1 2)- mistralai/mixtral-8x7b-instruct-v0.1 3)- meta/llama-3.1-70b-instruct 4)- meta/llama-3.3-70b-instruct 5)- meta/llama-3.1-405b-instruct 6)- mistralai/mistral-nemo-12b-instruct 7)- nvidia/llama-3.1-nemotron-70b-instruct 8)- meta/llama-3.1-8b-instruct 9)- google/gemma-2-2b-it 10)- nvidia/nemotron-mini-4b-instruct The top1 LB model have high correlation with human judges (~0.90).
Attributes:
| Name | Type | Description |
|---|---|---|
name |
string
|
The name of the metrics |
answer_accuracy |
The AnswerAccuracy object |
ContextRelevance
dataclass
ContextRelevance(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'retrieved_contexts'}})(), name: str = 'nv_context_relevance', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None)
Bases: MetricWithLLM, SingleTurnMetric
Parameters: Score the relevance of the retrieved contexts be based on the user input.
Input: data: list of Dicts with keys: user_input, retrieved_contexts Output: 0.0: retrieved_contexts is not relevant for the user_input 0.5: retrieved_contexts is partially relevant for the user_input 1.0: retrieved_contexts is fully relevant for the user_input
ResponseGroundedness
dataclass
ResponseGroundedness(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'response', 'retrieved_contexts'}})(), name: str = 'nv_response_groundedness', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None)
Bases: MetricWithLLM, SingleTurnMetric
Parameters: Score the groundedness of the response based on the retrieved contexts.
Input: data: list of Dicts with keys: response, retrieved contexts Output: 0.0: response is not grounded in the retrieved contexts 0.5: response is partially grounded in the retrieved contexts 1.0: response is fully grounded in the retrieved contexts
SimpleCriteriaScore
SimpleCriteriaScore(name: str, definition: str, llm: Optional[BaseRagasLLM] = None, required_columns: Optional[Dict[MetricType, Set[str]]] = None, output_type: Optional[MetricOutputType] = DISCRETE, single_turn_prompt: Optional[PydanticPrompt] = None, multi_turn_prompt: Optional[PydanticPrompt] = None, strictness: int = 1)
Bases: MetricWithLLM, SingleTurnMetric, MultiTurnMetric
Judges the submission to give binary results using the criteria specified in the metric definition.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
name of the metrics |
definition |
str
|
criteria to score the submission |
strictness |
int
|
The number of times self consistency checks is made. Final judgement is made using majority vote. |
Source code in src/ragas/metrics/_simple_criteria.py
Metric
dataclass
Metric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: ABC
Abstract base class for metrics in Ragas.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
required_columns |
Dict[str, Set[str]]
|
A dictionary mapping metric type names to sets of required column names. This is
a property and raises |
score
Calculates the score for a single row of data.
Note
This method is deprecated and will be removed in 0.3. Please use single_turn_ascore or multi_turn_ascore instead.
Source code in src/ragas/metrics/base.py
ascore
async
Asynchronously calculates the score for a single row of data.
Note
This method is deprecated and will be removed in 0.3. Please use single_turn_ascore instead.
Source code in src/ragas/metrics/base.py
MetricType
Bases: Enum
Enumeration of metric types in Ragas.
Attributes:
| Name | Type | Description |
|---|---|---|
SINGLE_TURN |
str
|
Represents a single-turn metric type. |
MULTI_TURN |
str
|
Represents a multi-turn metric type. |
MetricWithLLM
dataclass
MetricWithLLM(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None)
Bases: Metric, PromptMixin
A metric class that uses a language model for evaluation.
Attributes:
| Name | Type | Description |
|---|---|---|
llm |
Optional[BaseRagasLLM]
|
The language model used for the metric. |
train
train(path: str, demonstration_config: Optional[DemonstrationConfig] = None, instruction_config: Optional[InstructionConfig] = None, callbacks: Optional[Callbacks] = None, run_config: Optional[RunConfig] = None, batch_size: Optional[int] = None, with_debugging_logs=False, raise_exceptions: bool = True) -> None
Train the metric using local JSON data
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to local JSON training data file |
required |
demonstration_config
|
DemonstrationConfig
|
Configuration for demonstration optimization |
None
|
instruction_config
|
InstructionConfig
|
Configuration for instruction optimization |
None
|
callbacks
|
Callbacks
|
List of callback functions |
None
|
run_config
|
RunConfig
|
Run configuration |
None
|
batch_size
|
int
|
Batch size for training |
None
|
with_debugging_logs
|
bool
|
Enable debugging logs |
False
|
raise_exceptions
|
bool
|
Whether to raise exceptions during training |
True
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If path is not provided or not a JSON file |
Source code in src/ragas/metrics/base.py
MultiTurnMetric
dataclass
MultiTurnMetric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: Metric
A metric class for evaluating multi-turn conversations.
This class extends the base Metric class to provide functionality for scoring multi-turn conversation samples.
multi_turn_score
multi_turn_score(sample: MultiTurnSample, callbacks: Callbacks = None) -> float
Score a multi-turn conversation sample synchronously.
May raise ImportError if nest_asyncio is not installed in Jupyter-like environments.
Source code in src/ragas/metrics/base.py
multi_turn_ascore
async
multi_turn_ascore(sample: MultiTurnSample, callbacks: Callbacks = None, timeout: Optional[float] = None) -> float
Score a multi-turn conversation sample asynchronously.
May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.
Source code in src/ragas/metrics/base.py
SingleTurnMetric
dataclass
SingleTurnMetric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: Metric
A metric class for evaluating single-turn interactions.
This class provides methods to score single-turn samples, both synchronously and asynchronously.
single_turn_score
single_turn_score(sample: SingleTurnSample, callbacks: Callbacks = None) -> float
Synchronously score a single-turn sample.
May raise ImportError if nest_asyncio is not installed in a Jupyter-like environment.
Source code in src/ragas/metrics/base.py
single_turn_ascore
async
single_turn_ascore(sample: SingleTurnSample, callbacks: Callbacks = None, timeout: Optional[float] = None) -> float
Asynchronously score a single-turn sample with an optional timeout.
May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.
Source code in src/ragas/metrics/base.py
DiscreteMetric
dataclass
DiscreteMetric(name: str, prompt: Optional[Union[str, Prompt]] = None, allowed_values: List[str] = (lambda: ['pass', 'fail'])())
Bases: LLMMetric
get_correlation
Calculate the correlation between gold labels and predictions. This is a placeholder method and should be implemented based on the specific metric.
Source code in src/ragas/metrics/discrete.py
BaseLLMMetric
dataclass
Bases: ABC
Base class for simple LLM-based metrics that return MetricResult objects.
LLMMetric
dataclass
Bases: BaseLLMMetric
LLM-based metric that uses prompts to generate structured responses.
get_correlation
abstractmethod
Calculate the correlation between gold scores and predicted scores. This is a placeholder method and should be implemented based on the specific metric.
Source code in src/ragas/metrics/llm_based.py
align_and_validate
align_and_validate(dataset: Dataset, embedding_model: Union[BaseRagasEmbeddings, BaseRagasEmbedding], llm: InstructorBaseRagasLLM, test_size: float = 0.2, random_state: int = 42, **kwargs: Dict[str, Any])
Args: dataset: experiment to align the metric with. embedding_model: The embedding model used for dynamic few-shot prompting. llm: The LLM instance to use for scoring.
Align the metric with the specified experiments and validate it against a gold standard experiment. This method combines alignment and validation into a single step.
Source code in src/ragas/metrics/llm_based.py
align
align(train_dataset: Dataset, embedding_model: Union[BaseRagasEmbeddings, BaseRagasEmbedding], **kwargs: Dict[str, Any])
Args: train_dataset: train_dataset to align the metric with. embedding_model: The embedding model used for dynamic few-shot prompting.
Align the metric with the specified experiments by different optimization methods.
Source code in src/ragas/metrics/llm_based.py
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 | |
validate_alignment
validate_alignment(llm: InstructorBaseRagasLLM, test_dataset: Dataset, mapping: Dict[str, str] = {})
Args: llm: The LLM instance to use for scoring. test_dataset: An Dataset instance containing the gold standard scores. mapping: A dictionary mapping variable names expected by metrics to their corresponding names in the gold experiment.
Validate the alignment of the metric by comparing the scores against a gold standard experiment. This method computes the Cohen's Kappa score and agreement rate between the gold standard scores and the predicted scores from the metric.
Source code in src/ragas/metrics/llm_based.py
NumericMetric
dataclass
NumericMetric(name: str, prompt: Optional[Union[str, Prompt]] = None, allowed_values: Union[Tuple[float, float], range] = (0.0, 1.0))
Bases: LLMMetric
get_correlation
Calculate the correlation between gold labels and predictions. This is a placeholder method and should be implemented based on the specific metric.
Source code in src/ragas/metrics/numeric.py
RankingMetric
dataclass
Bases: LLMMetric
get_correlation
Calculate the correlation between gold labels and predictions. This is a placeholder method and should be implemented based on the specific metric.
Source code in src/ragas/metrics/ranking.py
MetricResult
Class to hold the result of a metric evaluation.
This class behaves like its underlying result value but still provides access to additional metadata like reasoning.
Works with: - DiscreteMetrics (string results) - NumericMetrics (float/int results) - RankingMetrics (list results)
Source code in src/ragas/metrics/result.py
to_dict
validate
classmethod
Provide compatibility with older Pydantic versions.