Bring your own LLMs

Ragas uses langchain under the hood for connecting to LLMs for metrics that require them. This means you can swap out the default LLM we use (gpt-3.5-turbo-16k) with any 100s of API supported out of the box by langchain.

This guide will show you how to use another LLM API for evaluation.


If your looking to use Azure OpenAI for evaluation checkout this guide

Evaluating with GPT4

Ragas uses gpt3.5 by default but using gpt4 for evaluation can improve the results so lets use that for the Faithfulness metric.

To start-off, we initialise the gpt4 chat_model from langchain.

# make sure you have you OpenAI API key ready
import os

os.environ["OPENAI_API_KEY"] = "your-openai-key"
from langchain_openai.chat_models import ChatOpenAI

gpt4 = ChatOpenAI(model_name="gpt-4")

Now that we have setup the llm we can use faithfulness with GPT-4 under the hood for evaluations.

Now lets run the evaluations using the example from quickstart.

# data
from datasets import load_dataset

amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")
Repo card metadata block was not found. Setting CardData to empty.
    eval: Dataset({
        features: ['question', 'ground_truths', 'answer', 'contexts'],
        num_rows: 20
# evaluate
from ragas import evaluate
from ragas.metrics import faithfulness

result = evaluate(
    amnesty_qa["eval"].select(range(10)),  # showing only 10 for demonstration

Evaluating: 100%|██████████| 10/10 [01:21<00:00,  8.18s/it]
{'faithfulness': 0.3389}

Evaluating with Open-Source LLMs

You can also use any of the Open-Source LLM for evaluation. Ragas support most the the deployment methods like HuggingFace TGI, Anyscale, vLLM and many more through Langchain.

When it comes to selecting open-source language models, there are some rules of thumb to follow, given that the quality of evaluation metrics depends heavily on the model’s quality:

  1. Opt for models with more than 7 billion parameters. This choice ensures a minimum level of quality in the results for ragas metrics. Models like Llama-2 or Mistral can be an excellent starting point.

  2. Always prioritize finetuned models over base models. Finetuned models tend to follow instructions more effectively, which can significantly improve their performance.

  3. If your project focuses on a specific domain, such as science or finance, prioritize models that have been pre-trained on a larger volume of tokens from your domain of interest. For instance, if you are working with research data, consider models pre-trained on a substantial number of tokens from platforms like arXiv or Semantic Scholar.


Choosing the right Open-Source LLM for evaluation can by tricky. You can also fine-tune these models to get even better performance on Ragas meterics. If you need some help/advice on that feel free to talk to us

In this example we are going to use vLLM for hosting a HuggingFaceH4/zephyr-7b-alpha. Checkout the quickstart for more details on how to get started with vLLM.

# start the vLLM server
!python -m vllm.entrypoints.openai.api_server \
    --model HuggingFaceH4/zephyr-7b-alpha \
    --host \
    --port 8080

Now lets create an Langchain llm instance. Because vLLM can run in OpenAI compatibilitiy mode, we can use the ChatOpenAI class as it is with small tweaks.

from langchain_openai.chat_models import ChatOpenAI

inference_server_url = "http://localhost:8080/v1"

# create vLLM Langchain instance
chat = ChatOpenAI(

Now lets import the metrics you want to use and change the llm in the evaluation.

# evaluate
from ragas.metrics import faithfulness
from ragas import evaluate

result = evaluate(
    amnesty_qa["eval"].select(range(1)),  # showing only 1 for demonstration

evaluating with [faithfulness]
100%|████████████████████████████████████████████████████████████| 1/1 [06:25<00:00, 385.74s/it]
{'faithfulness': 0.7167}