Evaluate Using Metrics
Run ragas metrics for evaluating RAG
In this tutorial, we will take a sample test dataset, select a few of the available metrics that Ragas offers, and evaluate a simple RAG pipeline.
Working with Data
The dataset used here is from Amnesty QA RAG that contains the necessary data points we need for this tutorial. Here I am loading it from huggingface hub, but you may use file from any source.
from datasets import load_dataset
dataset = load_dataset("explodinggradients/amnesty_qa","english_v3")
Converting data to ragas evaluation dataset
from ragas import EvaluationDataset
eval_dataset = EvaluationDataset.from_hf_dataset(dataset["eval"])
Selecting required metrics
Ragas offers a wide variety of metrics that one can select from to evaluate LLM applications. You can also build your own metrics on top of ragas. For this tutorial, we will select a few metrics that are commonly used to evaluate single turn RAG systems.
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, SemanticSimilarity
from ragas import evaluate
Since all of the metrics we have chosen are LLM-based metrics, we need to choose the evaluator LLMs we want to use for evaluation.
Choosing evaluator LLM
This guide utilizes OpenAI for running some metrics, so ensure you have your OpenAI key ready and available in your environment.
Wrapp the LLMs inLangchainLLMWrapper
First you have to set your AWS credentials and configurations
config = {
"credentials_profile_name": "your-profile-name", # E.g "default"
"region_name": "your-region-name", # E.g. "us-east-1"
"model_id": "your-model-id", # E.g "anthropic.claude-v2"
"model_kwargs": {"temperature": 0.4},
}
from langchain_aws.chat_models import BedrockChat
from ragas.llms import LangchainLLMWrapper
evaluator_llm = LangchainLLMWrapper(BedrockChat(
credentials_profile_name=config["credentials_profile_name"],
region_name=config["region_name"],
endpoint_url=f"https://bedrock-runtime.{config['region_name']}.amazonaws.com",
model_id=config["model_id"],
model_kwargs=config["model_kwargs"],
))
Running Evaluation
metrics = [LLMContextRecall(), FactualCorrectness(), Faithfulness()]
results = evaluate(dataset=eval_dataset, metrics=metrics, llm=evaluator_llm,)