Evaluate Using Metrics

Run ragas metrics for evaluating RAG

In this tutorial, we will take a sample test dataset, select a few of the available metrics that Ragas offers, and evaluate a simple RAG pipeline.

Working with Data

The dataset used here is from Amnesty QA RAG that contains the necessary data points we need for this tutorial. Here I am loading it from huggingface hub, but you may use file from any source.

from datasets import load_dataset
dataset = load_dataset("explodinggradients/amnesty_qa","english_v3")

Load the dataset into Ragas EvaluationDataset object.

from ragas import EvaluationDataset

eval_dataset = EvaluationDataset.from_hf_dataset(dataset["eval"])

Selecting required metrics

Ragas offers a wide variety of metrics that one can select from to evaluate LLM applications. You can also build your own metrics on top of ragas. For this tutorial, we will select a few metrics that are commonly used to evaluate single turn RAG systems.

from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, SemanticSimilarity
from ragas import evaluate

Since all of the metrics we have chosen are LLM-based metrics, we need to choose the evaluator LLMs we want to use for evaluation.

Choosing evaluator LLM

OpenAIAmazon Bedrock

Install the langchain-openai package

pip install langchain-openai

Ensure you have your OpenAI key ready and available in your environment.

import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"

Wrap the LLMs in LangchainLLMWrapper so that it can be used with ragas.

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Install the langchain-aws package

pip install langchain-aws

then you have to set your AWS credentials and configurations

config = {
    "credentials_profile_name": "your-profile-name",  # E.g "default"
    "region_name": "your-region-name",  # E.g. "us-east-1"
    "llm": "your-llm-model-id",  # E.g "anthropic.claude-3-5-sonnet-20240620-v1:0"
    "embeddings": "your-embedding-model-id",  # E.g "amazon.titan-embed-text-v2:0"
    "temperature": 0.4,
}

define you LLMs and wrap them in LangchainLLMWrapper so that it can be used with ragas.

from langchain_aws import ChatBedrockConverse
from langchain_aws import BedrockEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

evaluator_llm = LangchainLLMWrapper(ChatBedrockConverse(
    credentials_profile_name=config["credentials_profile_name"],
    region_name=config["region_name"],
    base_url=f"https://bedrock-runtime.{config['region_name']}.amazonaws.com",
    model=config["llm"],
    temperature=config["temperature"],
))
evaluator_embeddings = LangchainEmbeddingsWrapper(BedrockEmbeddings(
    credentials_profile_name=config["credentials_profile_name"],
    region_name=config["region_name"],
    model_id=config["embeddings"],
))

If you want more information on how to use other AWS services, please refer to the langchain-aws documentation.

Running Evaluation

metrics = [
    LLMContextRecall(llm=evaluator_llm), 
    FactualCorrectness(llm=evaluator_llm), 
    Faithfulness(llm=evaluator_llm),
    SemanticSimilarity(embeddings=evaluator_embeddings)
]
results = evaluate(dataset=eval_dataset, metrics=metrics)

Exporting and analyzing results

df = results.to_pandas()
df.head()