Evaluate Using Metrics
Run ragas metrics for evaluating RAG
In this tutorial, we will take a sample test dataset, select a few of the available metrics that Ragas offers, and evaluate a simple RAG pipeline.
Working with Data
The dataset used here is from Amnesty QA RAG that contains the necessary data points we need for this tutorial. Here I am loading it from huggingface hub, but you may use file from any source.
from datasets import load_dataset
dataset = load_dataset("explodinggradients/amnesty_qa","english_v3")
Load the dataset into Ragas EvaluationDataset object.
from ragas import EvaluationDataset
eval_dataset = EvaluationDataset.from_hf_dataset(dataset["eval"])
Selecting required metrics
Ragas offers a wide variety of metrics that one can select from to evaluate LLM applications. You can also build your own metrics on top of ragas. For this tutorial, we will select a few metrics that are commonly used to evaluate single turn RAG systems.
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, SemanticSimilarity
from ragas import evaluate
Since all of the metrics we have chosen are LLM-based metrics, we need to choose the evaluator LLMs we want to use for evaluation.
Choosing evaluator LLM
Install the langchain-openai package
Ensure you have your OpenAI key ready and available in your environment.
Wrap the LLMs inLangchainLLMWrapper
so that it can be used with ragas.
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
Install the langchain-aws package
then you have to set your AWS credentials and configurations
config = {
"credentials_profile_name": "your-profile-name", # E.g "default"
"region_name": "your-region-name", # E.g. "us-east-1"
"llm": "your-llm-model-id", # E.g "anthropic.claude-3-5-sonnet-20240620-v1:0"
"embeddings": "your-embedding-model-id", # E.g "amazon.titan-embed-text-v2:0"
"temperature": 0.4,
}
define you LLMs and wrap them in LangchainLLMWrapper
so that it can be used with ragas.
from langchain_aws import ChatBedrockConverse
from langchain_aws import BedrockEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
evaluator_llm = LangchainLLMWrapper(ChatBedrockConverse(
credentials_profile_name=config["credentials_profile_name"],
region_name=config["region_name"],
base_url=f"https://bedrock-runtime.{config['region_name']}.amazonaws.com",
model=config["llm"],
temperature=config["temperature"],
))
evaluator_embeddings = LangchainEmbeddingsWrapper(BedrockEmbeddings(
credentials_profile_name=config["credentials_profile_name"],
region_name=config["region_name"],
model_id=config["embeddings"],
))
If you want more information on how to use other AWS services, please refer to the langchain-aws documentation.
Running Evaluation
metrics = [
LLMContextRecall(llm=evaluator_llm),
FactualCorrectness(llm=evaluator_llm),
Faithfulness(llm=evaluator_llm),
SemanticSimilarity(embeddings=evaluator_embeddings)
]
results = evaluate(dataset=eval_dataset, metrics=metrics)