Using Vertex AIΒΆ

Vertex AI offers everything you need to build and use generative AIβ€”from AI solutions, to Search and Conversation, to 100+ foundation models, to a unified AI platform. You get access to models like PaLM 2 which can be used to score your RAG responses and pipelines with Ragas instead of the default OpenAI.

This tutorial will show you can you can use PaLM 2 with Ragas for evaluation.

Note

this guide is for folks who are using the Amazon Bedrock endpoints. Check the evaluation guide if your using OpenAI endpoints.

Load Sample DatasetΒΆ

# data
from datasets import load_dataset

amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")
amnesty_qa
Found cached dataset fiqa (/home/jjmachan/.cache/huggingface/datasets/explodinggradients___fiqa/ragas_eval/1.0.0/3dc7b639f5b4b16509a3299a2ceb78bf5fe98ee6b5fee25e7d5e4d290c88efb8)
DatasetDict({
    baseline: Dataset({
        features: ['question', 'ground_truth', 'answer', 'contexts'],
        num_rows: 30
    })
})

Now lets import the metrics we are going to use

```python
from ragas.metrics import (
    context_precision,
    answer_relevancy,  # AnswerRelevancy
    faithfulness,
    context_recall,
)
from ragas.metrics.critique import harmfulness

# list of metrics we're going to use
metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    harmfulness,
]

By default Ragas uses ChatOpenAI for evaluations, lets swap that out with ChatVertexAI. We also need to change the embeddings used for evaluations for OpenAIEmbeddings to VertexAIEmbeddings for metrices that need it, which in our case is answer_relevancy.

import google.auth
from langchain.chat_models import ChatVertexAI
from langchain.embeddings import VertexAIEmbeddings


config = {
    "project_id": "tmp-project-404003",
}

# authenticate to GCP
creds, _ = google.auth.default(quota_project_id="tmp-project-404003")
# create Langchain LLM and Embeddings
ragas_vertexai_llm = ChatVertexAI(credentials=creds)
vertexai_embeddings = VertexAIEmbeddings(credentials=creds)

Now lets swap out the defaults with the VertexAI LLM and Embeddings we created.

for m in metrics:
    # change LLM for metric
    m.__setattr__("llm", ragas_vertexai_llm)

    # check if this metric needs embeddings
    if hasattr(m, "embeddings"):
        # if so change with VertexAI Embeddings
        m.__setattr__("embeddings", vertexai_embeddings)

EvaluationΒΆ

Running the evalutation is as simple as calling evaluate on the Dataset with the metrics of your choice.

from ragas import evaluate

result = evaluate(
    amnesty_qa["eval"].select(range(1)),  # using 1 as example due to quota constrains
    metrics=metrics,
)

result
evaluating with [faithfulness]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:04<00:00,  4.53s/it]
evaluating with [answer_relevancy]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:03<00:00,  3.88s/it]
evaluating with [context_recall]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:02<00:00,  2.71s/it]
evaluating with [context_precision]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  1.05it/s]
evaluating with [harmfulness]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  1.02it/s]
{'faithfulness': 1.0000, 'answer_relevancy': 0.9113, 'context_recall': 0.0000, 'context_precision': 0.0000, 'harmfulness': 0.0000}

and there you have the it, all the scores you need.

now if we want to dig into the results and figure out examples where your pipeline performed worse or really good you can easily convert it into a pandas array and use your standard analytics tools too!

df = result.to_pandas()
df.head()
question contexts answer ground_truth faithfulness answer_relevancy context_recall context_precision harmfulness
0 How to deposit a cheque issued to an associate... [Just have the associate sign the back and the... \nThe best way to deposit a cheque issued to a... [Have the check reissued to the proper payee.J... 1.0 0.911332 0.0 0.0 0

And thats it!

if you have any suggestion/feedbacks/things your not happy about, please do share it in the issue section. We love hearing from you 😁