Metrics
MetricType
Bases: Enum
Enumeration of metric types in Ragas.
Attributes:
| Name | Type | Description |
|---|---|---|
SINGLE_TURN |
str
|
Represents a single-turn metric type. |
MULTI_TURN |
str
|
Represents a multi-turn metric type. |
Metric
dataclass
Metric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: ABC
Abstract base class for metrics in Ragas.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
required_columns |
Dict[str, Set[str]]
|
A dictionary mapping metric type names to sets of required column names. This is
a property and raises |
MetricWithLLM
dataclass
MetricWithLLM(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None)
Bases: Metric, PromptMixin
A metric class that uses a language model for evaluation.
Attributes:
| Name | Type | Description |
|---|---|---|
llm |
Optional[BaseRagasLLM]
|
The language model used for the metric. Both BaseRagasLLM and InstructorBaseRagasLLM are accepted at runtime via duck typing (both have compatible methods). |
init
init(run_config: RunConfig) -> None
Initialize the metric with run configuration and validate LLM is present.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_config
|
RunConfig
|
Configuration for the metric run. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no LLM is provided to the metric. |
Source code in src/ragas/metrics/base.py
train
train(path: str, demonstration_config: Optional[DemonstrationConfig] = None, instruction_config: Optional[InstructionConfig] = None, callbacks: Optional[Callbacks] = None, run_config: Optional[RunConfig] = None, batch_size: Optional[int] = None, with_debugging_logs=False, raise_exceptions: bool = True) -> None
Train the metric using local JSON data
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to local JSON training data file |
required |
demonstration_config
|
DemonstrationConfig
|
Configuration for demonstration optimization |
None
|
instruction_config
|
InstructionConfig
|
Configuration for instruction optimization |
None
|
callbacks
|
Callbacks
|
List of callback functions |
None
|
run_config
|
RunConfig
|
Run configuration |
None
|
batch_size
|
int
|
Batch size for training |
None
|
with_debugging_logs
|
bool
|
Enable debugging logs |
False
|
raise_exceptions
|
bool
|
Whether to raise exceptions during training |
True
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If path is not provided or not a JSON file |
Source code in src/ragas/metrics/base.py
SingleTurnMetric
dataclass
SingleTurnMetric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: Metric
A metric class for evaluating single-turn interactions.
This class provides methods to score single-turn samples, both synchronously and asynchronously.
single_turn_score
single_turn_score(sample: SingleTurnSample, callbacks: Callbacks = None) -> float
Synchronously score a single-turn sample.
May raise ImportError if nest_asyncio is not installed in a Jupyter-like environment.
Source code in src/ragas/metrics/base.py
single_turn_ascore
async
single_turn_ascore(sample: SingleTurnSample, callbacks: Callbacks = None, timeout: Optional[float] = None) -> float
Asynchronously score a single-turn sample with an optional timeout.
May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.
Source code in src/ragas/metrics/base.py
MultiTurnMetric
dataclass
MultiTurnMetric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: Metric
A metric class for evaluating multi-turn conversations.
This class extends the base Metric class to provide functionality for scoring multi-turn conversation samples.
multi_turn_score
multi_turn_score(sample: MultiTurnSample, callbacks: Callbacks = None) -> float
Score a multi-turn conversation sample synchronously.
May raise ImportError if nest_asyncio is not installed in Jupyter-like environments.
Source code in src/ragas/metrics/base.py
multi_turn_ascore
async
multi_turn_ascore(sample: MultiTurnSample, callbacks: Callbacks = None, timeout: Optional[float] = None) -> float
Score a multi-turn conversation sample asynchronously.
May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.
Source code in src/ragas/metrics/base.py
Ensember
Combine multiple llm outputs for same input (n>1) to a single output
from_discrete
Simple majority voting for binary values, ie [0,0,1] -> 0 inputs: list of list of dicts each containing verdict for a single input
Source code in src/ragas/metrics/base.py
SimpleBaseMetric
dataclass
Bases: ABC
Base class for simple metrics that return MetricResult objects.
This class provides the foundation for metrics that evaluate inputs and return structured MetricResult objects containing scores and reasoning.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
allowed_values |
AllowedValuesType
|
Allowed values for the metric output. Can be a list of strings for discrete metrics, a tuple of floats for numeric metrics, or an integer for ranking metrics. |
Examples:
>>> from ragas.metrics import discrete_metric
>>>
>>> @discrete_metric(name="sentiment", allowed_values=["positive", "negative"])
>>> def sentiment_metric(user_input: str, response: str) -> str:
... return "positive" if "good" in response else "negative"
>>>
>>> result = sentiment_metric(user_input="How are you?", response="I'm good!")
>>> print(result.value) # "positive"
score
abstractmethod
Synchronously calculate the metric score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
dict
|
Input parameters required by the specific metric implementation. |
{}
|
Returns:
| Type | Description |
|---|---|
MetricResult
|
The evaluation result containing the score and reasoning. |
Source code in src/ragas/metrics/base.py
ascore
abstractmethod
async
Asynchronously calculate the metric score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
dict
|
Input parameters required by the specific metric implementation. |
{}
|
Returns:
| Type | Description |
|---|---|
MetricResult
|
The evaluation result containing the score and reasoning. |
Source code in src/ragas/metrics/base.py
batch_score
Synchronously calculate scores for a batch of inputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
List[Dict[str, Any]]
|
List of input dictionaries, each containing parameters for the metric. |
required |
Returns:
| Type | Description |
|---|---|
List[MetricResult]
|
List of evaluation results, one for each input. |
Source code in src/ragas/metrics/base.py
abatch_score
async
Asynchronously calculate scores for a batch of inputs in parallel.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
List[Dict[str, Any]]
|
List of input dictionaries, each containing parameters for the metric. |
required |
Returns:
| Type | Description |
|---|---|
List[MetricResult]
|
List of evaluation results, one for each input. |
Source code in src/ragas/metrics/base.py
SimpleLLMMetric
dataclass
SimpleLLMMetric(name: str, allowed_values: AllowedValuesType = (lambda: ['pass', 'fail'])(), prompt: Optional[Union[str, 'Prompt']] = None)
Bases: SimpleBaseMetric
LLM-based metric that uses prompts to generate structured responses.
save
Save the metric configuration to a JSON file.
Parameters:
path : str, optional File path to save to. If not provided, saves to "./{metric.name}.json" Use .gz extension for compression.
Note:
If the metric has a response_model, its schema will be saved for reference but the model itself cannot be serialized. You'll need to provide it when loading.
Examples:
All these work:
metric.save() # → ./response_quality.json metric.save("custom.json") # → ./custom.json metric.save("/path/to/metrics/") # → /path/to/metrics/response_quality.json metric.save("no_extension") # → ./no_extension.json metric.save("compressed.json.gz") # → ./compressed.json.gz (compressed)
Source code in src/ragas/metrics/base.py
935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 | |
load
classmethod
load(path: str, response_model: Optional[Type['BaseModel']] = None, embedding_model: Optional['EmbeddingModelType'] = None) -> 'SimpleLLMMetric'
Load a metric from a JSON file.
Parameters:
path : str File path to load from. Supports .gz compressed files. response_model : Optional[Type[BaseModel]] Pydantic model to use for response validation. Required for custom SimpleLLMMetrics. embedding_model : Optional[Any] Embedding model for DynamicFewShotPrompt. Required if the original used one.
Returns:
SimpleLLMMetric Loaded metric instance
Raises:
ValueError If file cannot be loaded, is invalid, or missing required models
Source code in src/ragas/metrics/base.py
1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 | |
get_correlation
abstractmethod
Calculate the correlation between gold scores and predicted scores. This is a placeholder method and should be implemented based on the specific metric.
Source code in src/ragas/metrics/base.py
align_and_validate
align_and_validate(dataset: 'Dataset', embedding_model: 'EmbeddingModelType', llm: 'BaseRagasLLM', test_size: float = 0.2, random_state: int = 42, **kwargs: Dict[str, Any])
Args: dataset: experiment to align the metric with. embedding_model: The embedding model used for dynamic few-shot prompting. llm: The LLM instance to use for scoring.
Align the metric with the specified experiments and validate it against a gold standard experiment. This method combines alignment and validation into a single step.
Source code in src/ragas/metrics/base.py
align
Args: train_dataset: train_dataset to align the metric with. embedding_model: The embedding model used for dynamic few-shot prompting.
Align the metric with the specified experiments by different optimization methods.
Source code in src/ragas/metrics/base.py
1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 | |
validate_alignment
Args: llm: The LLM instance to use for scoring. test_dataset: An Dataset instance containing the gold standard scores. mapping: A dictionary mapping variable names expected by metrics to their corresponding names in the gold experiment.
Validate the alignment of the metric by comparing the scores against a gold standard experiment. This method computes the Cohen's Kappa score and agreement rate between the gold standard scores and the predicted scores from the metric.
Source code in src/ragas/metrics/base.py
create_auto_response_model
Create a response model and mark it as auto-generated by Ragas.
This function creates a Pydantic model using create_model and marks it with a special attribute to indicate it was auto-generated. This allows the save() method to distinguish between auto-generated models (which are recreated on load) and custom user models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name for the model class |
required |
**fields
|
Field definitions in create_model format. Each field is specified as: field_name=(type, default_or_field_info) |
{}
|
Returns:
| Type | Description |
|---|---|
Type[BaseModel]
|
Pydantic model class marked as auto-generated |
Examples:
>>> from pydantic import Field
>>> # Simple model with required fields
>>> ResponseModel = create_auto_response_model(
... "ResponseModel",
... value=(str, ...),
... reason=(str, ...)
... )
>>>
>>> # Model with Field validators and descriptions
>>> ResponseModel = create_auto_response_model(
... "ResponseModel",
... value=(str, Field(..., description="The predicted value")),
... reason=(str, Field(..., description="Reasoning for the prediction"))
... )
Source code in src/ragas/metrics/base.py
Metric
dataclass
Metric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: ABC
Abstract base class for metrics in Ragas.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
required_columns |
Dict[str, Set[str]]
|
A dictionary mapping metric type names to sets of required column names. This is
a property and raises |
MetricType
Bases: Enum
Enumeration of metric types in Ragas.
Attributes:
| Name | Type | Description |
|---|---|---|
SINGLE_TURN |
str
|
Represents a single-turn metric type. |
MULTI_TURN |
str
|
Represents a multi-turn metric type. |
MetricWithLLM
dataclass
MetricWithLLM(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None)
Bases: Metric, PromptMixin
A metric class that uses a language model for evaluation.
Attributes:
| Name | Type | Description |
|---|---|---|
llm |
Optional[BaseRagasLLM]
|
The language model used for the metric. Both BaseRagasLLM and InstructorBaseRagasLLM are accepted at runtime via duck typing (both have compatible methods). |
init
init(run_config: RunConfig) -> None
Initialize the metric with run configuration and validate LLM is present.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_config
|
RunConfig
|
Configuration for the metric run. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no LLM is provided to the metric. |
Source code in src/ragas/metrics/base.py
train
train(path: str, demonstration_config: Optional[DemonstrationConfig] = None, instruction_config: Optional[InstructionConfig] = None, callbacks: Optional[Callbacks] = None, run_config: Optional[RunConfig] = None, batch_size: Optional[int] = None, with_debugging_logs=False, raise_exceptions: bool = True) -> None
Train the metric using local JSON data
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to local JSON training data file |
required |
demonstration_config
|
DemonstrationConfig
|
Configuration for demonstration optimization |
None
|
instruction_config
|
InstructionConfig
|
Configuration for instruction optimization |
None
|
callbacks
|
Callbacks
|
List of callback functions |
None
|
run_config
|
RunConfig
|
Run configuration |
None
|
batch_size
|
int
|
Batch size for training |
None
|
with_debugging_logs
|
bool
|
Enable debugging logs |
False
|
raise_exceptions
|
bool
|
Whether to raise exceptions during training |
True
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If path is not provided or not a JSON file |
Source code in src/ragas/metrics/base.py
MultiTurnMetric
dataclass
MultiTurnMetric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: Metric
A metric class for evaluating multi-turn conversations.
This class extends the base Metric class to provide functionality for scoring multi-turn conversation samples.
multi_turn_score
multi_turn_score(sample: MultiTurnSample, callbacks: Callbacks = None) -> float
Score a multi-turn conversation sample synchronously.
May raise ImportError if nest_asyncio is not installed in Jupyter-like environments.
Source code in src/ragas/metrics/base.py
multi_turn_ascore
async
multi_turn_ascore(sample: MultiTurnSample, callbacks: Callbacks = None, timeout: Optional[float] = None) -> float
Score a multi-turn conversation sample asynchronously.
May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.
Source code in src/ragas/metrics/base.py
BaseMetric
dataclass
Bases: ABC
Base class for simple metrics that return MetricResult objects.
This class provides the foundation for metrics that evaluate inputs and return structured MetricResult objects containing scores and reasoning.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
allowed_values |
AllowedValuesType
|
Allowed values for the metric output. Can be a list of strings for discrete metrics, a tuple of floats for numeric metrics, or an integer for ranking metrics. |
Examples:
>>> from ragas.metrics import discrete_metric
>>>
>>> @discrete_metric(name="sentiment", allowed_values=["positive", "negative"])
>>> def sentiment_metric(user_input: str, response: str) -> str:
... return "positive" if "good" in response else "negative"
>>>
>>> result = sentiment_metric(user_input="How are you?", response="I'm good!")
>>> print(result.value) # "positive"
score
abstractmethod
Synchronously calculate the metric score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
dict
|
Input parameters required by the specific metric implementation. |
{}
|
Returns:
| Type | Description |
|---|---|
MetricResult
|
The evaluation result containing the score and reasoning. |
Source code in src/ragas/metrics/base.py
ascore
abstractmethod
async
Asynchronously calculate the metric score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
dict
|
Input parameters required by the specific metric implementation. |
{}
|
Returns:
| Type | Description |
|---|---|
MetricResult
|
The evaluation result containing the score and reasoning. |
Source code in src/ragas/metrics/base.py
batch_score
Synchronously calculate scores for a batch of inputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
List[Dict[str, Any]]
|
List of input dictionaries, each containing parameters for the metric. |
required |
Returns:
| Type | Description |
|---|---|
List[MetricResult]
|
List of evaluation results, one for each input. |
Source code in src/ragas/metrics/base.py
abatch_score
async
Asynchronously calculate scores for a batch of inputs in parallel.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
List[Dict[str, Any]]
|
List of input dictionaries, each containing parameters for the metric. |
required |
Returns:
| Type | Description |
|---|---|
List[MetricResult]
|
List of evaluation results, one for each input. |
Source code in src/ragas/metrics/base.py
LLMMetric
dataclass
LLMMetric(name: str, allowed_values: AllowedValuesType = (lambda: ['pass', 'fail'])(), prompt: Optional[Union[str, 'Prompt']] = None)
Bases: SimpleBaseMetric
LLM-based metric that uses prompts to generate structured responses.
save
Save the metric configuration to a JSON file.
Parameters:
path : str, optional File path to save to. If not provided, saves to "./{metric.name}.json" Use .gz extension for compression.
Note:
If the metric has a response_model, its schema will be saved for reference but the model itself cannot be serialized. You'll need to provide it when loading.
Examples:
All these work:
metric.save() # → ./response_quality.json metric.save("custom.json") # → ./custom.json metric.save("/path/to/metrics/") # → /path/to/metrics/response_quality.json metric.save("no_extension") # → ./no_extension.json metric.save("compressed.json.gz") # → ./compressed.json.gz (compressed)
Source code in src/ragas/metrics/base.py
935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 | |
load
classmethod
load(path: str, response_model: Optional[Type['BaseModel']] = None, embedding_model: Optional['EmbeddingModelType'] = None) -> 'SimpleLLMMetric'
Load a metric from a JSON file.
Parameters:
path : str File path to load from. Supports .gz compressed files. response_model : Optional[Type[BaseModel]] Pydantic model to use for response validation. Required for custom SimpleLLMMetrics. embedding_model : Optional[Any] Embedding model for DynamicFewShotPrompt. Required if the original used one.
Returns:
SimpleLLMMetric Loaded metric instance
Raises:
ValueError If file cannot be loaded, is invalid, or missing required models
Source code in src/ragas/metrics/base.py
1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 | |
get_correlation
abstractmethod
Calculate the correlation between gold scores and predicted scores. This is a placeholder method and should be implemented based on the specific metric.
Source code in src/ragas/metrics/base.py
align_and_validate
align_and_validate(dataset: 'Dataset', embedding_model: 'EmbeddingModelType', llm: 'BaseRagasLLM', test_size: float = 0.2, random_state: int = 42, **kwargs: Dict[str, Any])
Args: dataset: experiment to align the metric with. embedding_model: The embedding model used for dynamic few-shot prompting. llm: The LLM instance to use for scoring.
Align the metric with the specified experiments and validate it against a gold standard experiment. This method combines alignment and validation into a single step.
Source code in src/ragas/metrics/base.py
align
Args: train_dataset: train_dataset to align the metric with. embedding_model: The embedding model used for dynamic few-shot prompting.
Align the metric with the specified experiments by different optimization methods.
Source code in src/ragas/metrics/base.py
1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 | |
validate_alignment
Args: llm: The LLM instance to use for scoring. test_dataset: An Dataset instance containing the gold standard scores. mapping: A dictionary mapping variable names expected by metrics to their corresponding names in the gold experiment.
Validate the alignment of the metric by comparing the scores against a gold standard experiment. This method computes the Cohen's Kappa score and agreement rate between the gold standard scores and the predicted scores from the metric.
Source code in src/ragas/metrics/base.py
SingleTurnMetric
dataclass
SingleTurnMetric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: Metric
A metric class for evaluating single-turn interactions.
This class provides methods to score single-turn samples, both synchronously and asynchronously.
single_turn_score
single_turn_score(sample: SingleTurnSample, callbacks: Callbacks = None) -> float
Synchronously score a single-turn sample.
May raise ImportError if nest_asyncio is not installed in a Jupyter-like environment.
Source code in src/ragas/metrics/base.py
single_turn_ascore
async
single_turn_ascore(sample: SingleTurnSample, callbacks: Callbacks = None, timeout: Optional[float] = None) -> float
Asynchronously score a single-turn sample with an optional timeout.
May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.
Source code in src/ragas/metrics/base.py
DiscreteMetric
dataclass
DiscreteMetric(name: str, allowed_values: List[str] = (lambda: ['pass', 'fail'])(), prompt: Optional[Union[str, 'Prompt']] = None)
Bases: SimpleLLMMetric, DiscreteValidator
Metric for categorical/discrete evaluations with predefined allowed values.
This class is used for metrics that output categorical values like "pass/fail", "good/bad/excellent", or custom discrete categories. Uses the instructor library for structured LLM outputs.
Attributes:
| Name | Type | Description |
|---|---|---|
allowed_values |
List[str]
|
List of allowed categorical values the metric can output. Default is ["pass", "fail"]. |
prompt |
Optional[Union[str, Prompt]]
|
The prompt template for the metric. Should contain placeholders for evaluation inputs that will be formatted at runtime. |
Examples:
>>> from ragas.metrics import DiscreteMetric
>>> from ragas.llms import llm_factory
>>> from openai import OpenAI
>>>
>>> # Create an LLM instance
>>> client = OpenAI(api_key="your-api-key")
>>> llm = llm_factory("gpt-4o-mini", client=client)
>>>
>>> # Create a custom discrete metric
>>> metric = DiscreteMetric(
... name="quality_check",
... prompt="Check the quality of the response: {response}. Return 'excellent', 'good', or 'poor'.",
... allowed_values=["excellent", "good", "poor"]
... )
>>>
>>> # Score with the metric
>>> result = metric.score(
... llm=llm,
... response="This is a great response!"
... )
>>> print(result.value) # Output: "excellent" or similar
get_correlation
Calculate the correlation between gold labels and predictions. This is a placeholder method and should be implemented based on the specific metric.
Source code in src/ragas/metrics/discrete.py
load
classmethod
load(path: str, embedding_model: Optional[EmbeddingModelType] = None) -> DiscreteMetric
Load a DiscreteMetric from a JSON file.
Parameters:
path : str File path to load from. Supports .gz compressed files. embedding_model : Optional[Any] Embedding model for DynamicFewShotPrompt. Required if the original used one.
Returns:
DiscreteMetric Loaded metric instance
Raises:
ValueError If file cannot be loaded or is not a DiscreteMetric
Source code in src/ragas/metrics/discrete.py
NumericMetric
dataclass
NumericMetric(name: str, allowed_values: Union[Tuple[float, float], range] = (0.0, 1.0), prompt: Optional[Union[str, 'Prompt']] = None)
Bases: SimpleLLMMetric, NumericValidator
Metric for continuous numeric evaluations within a specified range.
This class is used for metrics that output numeric scores within a defined range, such as 0.0 to 1.0 for similarity scores or 1-10 ratings. Uses the instructor library for structured LLM outputs.
Attributes:
| Name | Type | Description |
|---|---|---|
allowed_values |
Union[Tuple[float, float], range]
|
The valid range for metric outputs. Can be a tuple of (min, max) floats or a range object. Default is (0.0, 1.0). |
llm |
Optional[BaseRagasLLM]
|
The language model instance for evaluation. Can be created using llm_factory(). |
prompt |
Optional[Union[str, Prompt]]
|
The prompt template for the metric. Should contain placeholders for evaluation inputs that will be formatted at runtime. |
Examples:
>>> from ragas.metrics import NumericMetric
>>> from ragas.llms import llm_factory
>>> from openai import OpenAI
>>>
>>> # Create an LLM instance
>>> client = OpenAI(api_key="your-api-key")
>>> llm = llm_factory("gpt-4o-mini", client=client)
>>>
>>> # Create a custom numeric metric with 0-10 range
>>> metric = NumericMetric(
... name="quality_score",
... llm=llm,
... prompt="Rate the quality of this response on a scale of 0-10: {response}",
... allowed_values=(0.0, 10.0)
... )
>>>
>>> # Score with the metric
>>> result = metric.score(
... llm=llm,
... response="This is a great response!"
... )
>>> print(result.value) # Output: a float between 0.0 and 10.0
get_correlation
Calculate the correlation between gold labels and predictions. This is a placeholder method and should be implemented based on the specific metric.
Source code in src/ragas/metrics/numeric.py
load
classmethod
load(path: str, embedding_model: Optional[EmbeddingModelType] = None) -> NumericMetric
Load a NumericMetric from a JSON file.
Parameters:
path : str File path to load from. Supports .gz compressed files. embedding_model : Optional[Any] Embedding model for DynamicFewShotPrompt. Required if the original used one.
Returns:
NumericMetric Loaded metric instance
Raises:
ValueError If file cannot be loaded or is not a NumericMetric
Source code in src/ragas/metrics/numeric.py
RankingMetric
dataclass
Bases: SimpleLLMMetric, RankingValidator
Metric for evaluations that produce ranked lists of items.
This class is used for metrics that output ordered lists, such as ranking search results, prioritizing features, or ordering responses by relevance. Uses the instructor library for structured LLM outputs.
Attributes:
| Name | Type | Description |
|---|---|---|
allowed_values |
int
|
Expected number of items in the ranking list. Default is 2. |
llm |
Optional[BaseRagasLLM]
|
The language model instance for evaluation. Can be created using llm_factory(). |
prompt |
Optional[Union[str, Prompt]]
|
The prompt template for the metric. Should contain placeholders for evaluation inputs that will be formatted at runtime. |
Examples:
>>> from ragas.metrics import RankingMetric
>>> from ragas.llms import llm_factory
>>> from openai import OpenAI
>>>
>>> # Create an LLM instance
>>> client = OpenAI(api_key="your-api-key")
>>> llm = llm_factory("gpt-4o-mini", client=client)
>>>
>>> # Create a ranking metric that returns top 3 items
>>> metric = RankingMetric(
... name="relevance_ranking",
... llm=llm,
... prompt="Rank these results by relevance: {results}",
... allowed_values=3
... )
>>>
>>> # Score with the metric
>>> result = metric.score(
... llm=llm,
... results="result1, result2, result3"
... )
>>> print(result.value) # Output: a list of 3 ranked items
get_correlation
Calculate the correlation between gold labels and predictions. This is a placeholder method and should be implemented based on the specific metric.
Source code in src/ragas/metrics/ranking.py
load
classmethod
load(path: str, embedding_model: Optional[EmbeddingModelType] = None) -> RankingMetric
Load a RankingMetric from a JSON file.
Parameters:
path : str File path to load from. Supports .gz compressed files. embedding_model : Optional[Any] Embedding model for DynamicFewShotPrompt. Required if the original used one.
Returns:
RankingMetric Loaded metric instance
Raises:
ValueError If file cannot be loaded or is not a RankingMetric
Source code in src/ragas/metrics/ranking.py
MetricResult
Class to hold the result of a metric evaluation.
This class behaves like its underlying result value but still provides access to additional metadata like reasoning.
Works with: - DiscreteMetrics (string results) - NumericMetrics (float/int results) - RankingMetrics (list results)
Source code in src/ragas/metrics/result.py
to_dict
validate
classmethod
Provide compatibility with older Pydantic versions.
discrete_metric
discrete_metric(*, name: Optional[str] = None, allowed_values: Optional[List[str]] = None, **metric_params: Any) -> Callable[[Callable[..., Any]], DiscreteMetricProtocol]
Decorator for creating discrete/categorical metrics.
This decorator transforms a regular function into a DiscreteMetric instance that can be used for evaluation with predefined categorical outputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name for the metric. If not provided, uses the function name. |
None
|
allowed_values
|
List[str]
|
List of allowed categorical values for the metric output. Default is ["pass", "fail"]. |
None
|
**metric_params
|
Any
|
Additional parameters to pass to the metric initialization. |
{}
|
Returns:
| Type | Description |
|---|---|
Callable[[Callable[..., Any]], DiscreteMetricProtocol]
|
A decorator that transforms a function into a DiscreteMetric instance. |
Examples:
>>> from ragas.metrics import discrete_metric
>>>
>>> @discrete_metric(name="sentiment", allowed_values=["positive", "neutral", "negative"])
>>> def sentiment_analysis(user_input: str, response: str) -> str:
... '''Analyze sentiment of the response.'''
... if "great" in response.lower() or "good" in response.lower():
... return "positive"
... elif "bad" in response.lower() or "poor" in response.lower():
... return "negative"
... return "neutral"
>>>
>>> result = sentiment_analysis(
... user_input="How was your day?",
... response="It was great!"
... )
>>> print(result.value) # "positive"
Source code in src/ragas/metrics/discrete.py
numeric_metric
numeric_metric(*, name: Optional[str] = None, allowed_values: Optional[Union[Tuple[float, float], range]] = None, **metric_params: Any) -> Callable[[Callable[..., Any]], NumericMetricProtocol]
Decorator for creating numeric/continuous metrics.
This decorator transforms a regular function into a NumericMetric instance that outputs continuous values within a specified range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name for the metric. If not provided, uses the function name. |
None
|
allowed_values
|
Union[Tuple[float, float], range]
|
The valid range for metric outputs as (min, max) tuple or range object. Default is (0.0, 1.0). |
None
|
**metric_params
|
Any
|
Additional parameters to pass to the metric initialization. |
{}
|
Returns:
| Type | Description |
|---|---|
Callable[[Callable[..., Any]], NumericMetricProtocol]
|
A decorator that transforms a function into a NumericMetric instance. |
Examples:
>>> from ragas.metrics import numeric_metric
>>>
>>> @numeric_metric(name="relevance_score", allowed_values=(0.0, 1.0))
>>> def calculate_relevance(user_input: str, response: str) -> float:
... '''Calculate relevance score between 0 and 1.'''
... # Simple word overlap example
... user_words = set(user_input.lower().split())
... response_words = set(response.lower().split())
... if not user_words:
... return 0.0
... overlap = len(user_words & response_words)
... return overlap / len(user_words)
>>>
>>> result = calculate_relevance(
... user_input="What is Python?",
... response="Python is a programming language"
... )
>>> print(result.value) # Numeric score between 0.0 and 1.0
Source code in src/ragas/metrics/numeric.py
ranking_metric
ranking_metric(*, name: Optional[str] = None, allowed_values: Optional[int] = None, **metric_params: Any) -> Callable[[Callable[..., Any]], RankingMetricProtocol]
Decorator for creating ranking/ordering metrics.
This decorator transforms a regular function into a RankingMetric instance that outputs ordered lists of items.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name for the metric. If not provided, uses the function name. |
None
|
allowed_values
|
int
|
Expected number of items in the ranking list. Default is 2. |
None
|
**metric_params
|
Any
|
Additional parameters to pass to the metric initialization. |
{}
|
Returns:
| Type | Description |
|---|---|
Callable[[Callable[..., Any]], RankingMetricProtocol]
|
A decorator that transforms a function into a RankingMetric instance. |
Examples:
>>> from ragas.metrics import ranking_metric
>>>
>>> @ranking_metric(name="priority_ranker", allowed_values=3)
>>> def rank_by_urgency(user_input: str, responses: list) -> list:
... '''Rank responses by urgency keywords.'''
... urgency_keywords = ["urgent", "asap", "critical"]
... scored = []
... for resp in responses:
... score = sum(kw in resp.lower() for kw in urgency_keywords)
... scored.append((score, resp))
... # Sort by score descending and return top items
... ranked = sorted(scored, key=lambda x: x[0], reverse=True)
... return [item[1] for item in ranked[:3]]
>>>
>>> result = rank_by_urgency(
... user_input="What should I do first?",
... responses=["This is urgent", "Take your time", "Critical issue!"]
... )
>>> print(result.value) # Ranked list of responses