Write your own Metrics
While Ragas has a number of built-in metrics, you may find yourself needing to create a custom metric for your use case. This guide will help you do just that.
For the sake of this tutorial, let's assume we want to build a custom metric that measures the hallucinations in a LLM application. While we do have a built-in metric called Faithfulness which is similar but not exactly the same. Faithfulness
measures the factual consistency of the generated answer against the given context while Hallucinations
measures the presence of hallucinations in the generated answer.
before we start, lets load the dataset and define the llm
# dataset
from datasets import load_dataset
from ragas import EvaluationDataset
amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v3")
eval_dataset = EvaluationDataset.from_hf_dataset(amnesty_qa["eval"])
EvaluationDataset(features=['user_input', 'retrieved_contexts', 'response', 'reference'], len=20)
Install the langchain-openai package
Ensure you have your OpenAI key ready and available in your environment.
Wrap the LLMs inLangchainLLMWrapper
so that it can be used with ragas.
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
Install the langchain-aws package
Then you have to set your AWS credentials and configurations
config = {
"credentials_profile_name": "your-profile-name", # E.g "default"
"region_name": "your-region-name", # E.g. "us-east-1"
"llm": "your-llm-model-id", # E.g "anthropic.claude-3-5-sonnet-20241022-v2:0"
"embeddings": "your-embedding-model-id", # E.g "amazon.titan-embed-text-v2:0"
"temperature": 0.4,
}
Define your LLMs and wrap them in LangchainLLMWrapper
so that it can be used with ragas.
from langchain_aws import ChatBedrockConverse
from langchain_aws import BedrockEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
evaluator_llm = LangchainLLMWrapper(ChatBedrockConverse(
credentials_profile_name=config["credentials_profile_name"],
region_name=config["region_name"],
base_url=f"https://bedrock-runtime.{config['region_name']}.amazonaws.com",
model=config["llm"],
temperature=config["temperature"],
))
evaluator_embeddings = LangchainEmbeddingsWrapper(BedrockEmbeddings(
credentials_profile_name=config["credentials_profile_name"],
region_name=config["region_name"],
model_id=config["embeddings"],
))
If you want more information on how to use other AWS services, please refer to the langchain-aws documentation.
Install the langchain-openai package
Ensure you have your Azure OpenAI key ready and available in your environment.
import os
os.environ["AZURE_OPENAI_API_KEY"] = "your-azure-openai-key"
# other configuration
azure_config = {
"base_url": "", # your endpoint
"model_deployment": "", # your model deployment name
"model_name": "", # your model name
"embedding_deployment": "", # your embedding deployment name
"embedding_name": "", # your embedding name
}
Define your LLMs and wrap them in LangchainLLMWrapper
so that it can be used with ragas.
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
evaluator_llm = LangchainLLMWrapper(AzureChatOpenAI(
openai_api_version="2023-05-15",
azure_endpoint=azure_config["base_url"],
azure_deployment=azure_config["model_deployment"],
model=azure_config["model_name"],
validate_base_url=False,
))
# init the embeddings for answer_relevancy, answer_correctness and answer_similarity
evaluator_embeddings = LangchainEmbeddingsWrapper(AzureOpenAIEmbeddings(
openai_api_version="2023-05-15",
azure_endpoint=azure_config["base_url"],
azure_deployment=azure_config["embedding_deployment"],
model=azure_config["embedding_name"],
))
If you want more information on how to use other Azure services, please refer to the langchain-azure documentation.
If you are using a different LLM provider and using Langchain to interact with it, you can wrap your LLM in LangchainLLMWrapper
so that it can be used with ragas.
For a more detailed guide, checkout the guide on customizing models.
If you using LlamaIndex, you can use the LlamaIndexLLMWrapper
to wrap your LLM so that it can be used with ragas.
For more information on how to use LlamaIndex, please refer to the LlamaIndex Integration guide.
If your still not able use Ragas with your favorite LLM provider, please let us know by by commenting on this issue and we'll add support for it 🙂.
Aspect Critic - Simple Criteria Scoring
Aspect Critic that outputs a binary score for definition
you provide. A simple pass/fail metric can be bring clarity and focus to what you are trying to measure and is a better alocation of effort than building a more complex metric from scratch, especially when starting out.
Check out these resources to learn more about the effectiveness of having a simple pass/fail metric:
Now let's create a simple pass/fail metric to measure the hallucinations in the dataset with Ragas.
from ragas.metrics import AspectCritic
# you can init the metric with the evaluator llm
hallucinations_binary = AspectCritic(
name="hallucinations_binary",
definition="Did the model hallucinate or add any information that was not present in the retrieved context?",
llm=evaluator_llm,
)
await hallucinations_binary.single_turn_ascore(eval_dataset[0])
Domain Specific Metrics or Rubric based Metrics
Here we will build a rubric based metric that evaluates the data on a scale of 1 to 5 based on the rubric we provide. You can read more about the rubric based metrics here
For our example of building a hallucination metric, we will use the following rubric:
rubric = {
"score1_description": "There is no hallucination in the response. All the information in the response is present in the retrieved context.",
"score2_description": "There are no factual statements that are not present in the retrieved context but the response is not fully accurate and lacks important details.",
"score3_description": "There are many factual statements that are not present in the retrieved context.",
"score4_description": "The response contains some factual errors and lacks important details.",
"score5_description": "The model adds new information and statements that contradict the retrieved context.",
}
Now lets init the metric with the rubric and evaluator llm and evaluate the dataset.
from ragas.metrics import RubricsScore
hallucinations_rubric = RubricsScore(
name="hallucinations_rubric", llm=evaluator_llm, rubrics=rubric
)
await hallucinations_rubric.single_turn_ascore(eval_dataset[0])
Custom Metrics
If your use case is not covered by those two, you can build a custom metric by subclassing the base Metric
class in Ragas but before that you have to ask yourself the following questions:
-
Am I trying to build a single turn or multi turn metric? If yes, subclassing the
Metric
class along with either SingleTurnMetric or MultiTurnMetric depending on if you are evaluating single turn or multi turn interactions. -
Do I need to use LLMs to evaluate my metric? If yes, instead of subclassing the Metric class, subclassing the MetricWithLLM class.
-
Do I need to use embeddings to evaluate my metric? If yes, instead of subclassing the Metric class, subclassing the [MetricWithEmbeddings][ragas.metrics.base.MetricWithEmbeddings] class.
-
Do I need to use both LLM and Embeddings to evaluate my metric? If yes, subclass both the MetricWithLLM and [MetricWithEmbeddings][ragas.metrics.base.MetricWithEmbeddings] classes.
For our example, we need to to use LLMs to evaluate our metric so we will subclass the MetricWithLLM class and we are working for only single turn interactions for now so we will subclass the SingleTurnMetric class.
As for the implementation, we will use the Faithfulness metric to evaluate our metric to measure the hallucinations with the formula
# we are going to create a dataclass that subclasses `MetricWithLLM` and `SingleTurnMetric`
from dataclasses import dataclass, field
# import the base classes
from ragas.metrics.base import MetricWithLLM, SingleTurnMetric, MetricType
from ragas.metrics import Faithfulness
# import types
import typing as t
from ragas.callbacks import Callbacks
from ragas.dataset_schema import SingleTurnSample
@dataclass
class HallucinationsMetric(MetricWithLLM, SingleTurnMetric):
# name of the metric
name: str = "hallucinations_metric"
# we need to define the required columns for the metric
_required_columns: t.Dict[MetricType, t.Set[str]] = field(
default_factory=lambda: {
MetricType.SINGLE_TURN: {"user_input", "response", "retrieved_contexts"}
}
)
def __post_init__(self):
# init the faithfulness metric
self.faithfulness_metric = Faithfulness(llm=self.llm)
async def _single_turn_ascore(
self, sample: SingleTurnSample, callbacks: Callbacks
) -> float:
faithfulness_score = await self.faithfulness_metric.single_turn_ascore(
sample, callbacks
)
return 1 - faithfulness_score
hallucinations_metric = HallucinationsMetric(llm=evaluator_llm)
await hallucinations_metric.single_turn_ascore(eval_dataset[0])
Output
Now let's evaluate the entire dataset with the metrics we have created.
from ragas import evaluate
results = evaluate(
eval_dataset,
metrics=[hallucinations_metric, hallucinations_rubric, hallucinations_binary],
)
Output
user_input | retrieved_contexts | response | reference | hallucinations_metric | hallucinations_rubric | hallucinations_binary | |
---|---|---|---|---|---|---|---|
0 | What are the global implications of the USA Su... | [- In 2022, the USA Supreme Court handed down ... | The global implications of the USA Supreme Cou... | The global implications of the USA Supreme Cou... | 0.423077 | 3 | 0 |
1 | Which companies are the main contributors to G... | [In recent years, there has been increasing pr... | According to the Carbon Majors database, the m... | According to the Carbon Majors database, the m... | 0.862069 | 3 | 0 |
2 | Which private companies in the Americas are th... | [The issue of greenhouse gas emissions has bec... | According to the Carbon Majors database, the l... | The largest private companies in the Americas ... | 1.000000 | 3 | 0 |
3 | What action did Amnesty International urge its... | [In the case of the Ogoni 9, Amnesty Internati... | Amnesty International urged its supporters to ... | Amnesty International urged its supporters to ... | 0.400000 | 3 | 0 |
4 | What are the recommendations made by Amnesty I... | [In recent years, Amnesty International has fo... | Amnesty International made several recommendat... | The recommendations made by Amnesty Internatio... | 0.952381 | 3 | 0 |
If you want to learn more about how to build custom metrics, you can read the Custom Metrics Advanced guide.