Skip to content

Evaluate Using Metrics

Run ragas metrics for evaluating RAG

In this tutorial, we will take a sample test dataset, select a few of the available metrics that Ragas offers, and evaluate a simple RAG pipeline.

Working with Data

The dataset used here is from Amnesty QA RAG that contains the necessary data points we need for this tutorial. Here I am loading it from huggingface hub, but you may use file from any source.

from datasets import load_dataset
dataset = load_dataset("explodinggradients/amnesty_qa","english_v3")

Converting data to ragas evaluation dataset

from ragas import EvaluationDataset

eval_dataset = EvaluationDataset.from_hf_dataset(dataset["eval"])

Selecting required metrics

Ragas offers a wide variety of metrics that one can select from to evaluate LLM applications. You can also build your own metrics on top of ragas. For this tutorial, we will select a few metrics that are commonly used to evaluate single turn RAG systems.

from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, SemanticSimilarity
from ragas import evaluate

Since all of the metrics we have chosen are LLM-based metrics, we need to choose the evaluator LLMs we want to use for evaluation.

Choosing evaluator LLM

This guide utilizes OpenAI for running some metrics, so ensure you have your OpenAI key ready and available in your environment.

import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"
Wrapp the LLMs in LangchainLLMWrapper
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

First you have to set your AWS credentials and configurations

config = {
    "credentials_profile_name": "your-profile-name",  # E.g "default"
    "region_name": "your-region-name",  # E.g. "us-east-1"
    "model_id": "your-model-id",  # E.g "anthropic.claude-v2"
    "model_kwargs": {"temperature": 0.4},
}
define you LLMs
from langchain_aws.chat_models import BedrockChat
from ragas.llms import LangchainLLMWrapper
evaluator_llm = LangchainLLMWrapper(BedrockChat(
    credentials_profile_name=config["credentials_profile_name"],
    region_name=config["region_name"],
    endpoint_url=f"https://bedrock-runtime.{config['region_name']}.amazonaws.com",
    model_id=config["model_id"],
    model_kwargs=config["model_kwargs"],
))

Running Evaluation

metrics = [LLMContextRecall(), FactualCorrectness(), Faithfulness()]
results = evaluate(dataset=eval_dataset, metrics=metrics, llm=evaluator_llm,)

Exporting and analyzing results

df = result.to_pandas()
df.head()