Using Vertex AI¶
Vertex AI offers everything you need to build and use generative AI—from AI solutions, to Search and Conversation, to 100+ foundation models, to a unified AI platform. You get access to models like PaLM 2 which can be used to score your RAG responses and pipelines with Ragas instead of the default OpenAI.
This tutorial will show you can you can use PaLM 2 with Ragas for evaluation.
Note
this guide is for folks who are using Google VertexAI endpoints. Check the evaluation guide if your using OpenAI endpoints.
Load Sample Dataset¶
# data
from datasets import load_dataset
amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2", trust_remote_code=True)
amnesty_qa
/workspaces/ragas/venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Repo card metadata block was not found. Setting CardData to empty.
DatasetDict({
eval: Dataset({
features: ['question', 'ground_truth', 'answer', 'contexts'],
num_rows: 20
})
})
Now lets import the metrics we are going to use:
from ragas.metrics import (
context_precision,
answer_relevancy, # AnswerRelevancy
faithfulness,
context_recall,
)
from ragas.metrics.critique import harmfulness
# list of metrics we're going to use
metrics = [
faithfulness,
answer_relevancy,
context_recall,
context_precision,
harmfulness,
]
By default Ragas uses ChatOpenAI
for evaluations, lets swap that out with ChatVertexAI
. We’ll wrap ChatVertexAI
with Ragas’ LangchainLLMWrapper
object to work with the langchain-google-vertexai
package. We also need to change the embeddings used for evaluations for OpenAIEmbeddings
to VertexAIEmbeddings
for metrices that need it, which in our case is answer_relevancy
.
import google.auth
from langchain_google_vertexai import ChatVertexAI, VertexAIEmbeddings
from ragas.llms import LangchainLLMWrapper
config = {
"project_id": "genai-embedding-creation",
}
# authenticate to GCP
creds, _ = google.auth.default(quota_project_id="your-project-id")
# create Langchain LLM and Embeddings
llm = ChatVertexAI(model_name="chat-bison@002", credentials=creds)
ragas_vertexai_llm = LangchainLLMWrapper(llm)
vertexai_embeddings = VertexAIEmbeddings(model_name="textembedding-gecko@003", credentials=creds)
Now lets swap out the defaults with the VertexAI LLM and Embeddings we created.
for m in metrics:
# change LLM for metric
m.__setattr__("llm", ragas_vertexai_llm)
# check if this metric needs embeddings
if hasattr(m, "embeddings"):
# if so change with VertexAI Embeddings
m.__setattr__("embeddings", vertexai_embeddings)
Evaluation¶
Running the evalutation is as simple as calling evaluate on the Dataset
with the metrics of your choice.
from ragas import evaluate
result = evaluate(
amnesty_qa["eval"].select(range(1)), # using 1 as example due to quota constrains
metrics=metrics,
)
result
Evaluating: 100%|██████████| 5/5 [00:03<00:00, 1.53it/s]
{'faithfulness': 0.5000, 'answer_relevancy': 0.8608, 'context_recall': 1.0000, 'context_precision': 1.0000, 'harmfulness': 1.0000}
and there you have the it, all the scores you need.
now if we want to dig into the results and figure out examples where your pipeline performed worse or really good you can easily convert it into a pandas array and use your standard analytics tools too!
df = result.to_pandas()
df.head()
question | ground_truth | answer | contexts | faithfulness | answer_relevancy | context_recall | context_precision | harmfulness | |
---|---|---|---|---|---|---|---|---|---|
0 | What are the global implications of the USA Su... | The global implications of the USA Supreme Cou... | The global implications of the USA Supreme Cou... | [- In 2022, the USA Supreme Court handed down ... | 0.5 | 0.86077 | 1.0 | 1.0 | 1 |
And thats it!
if you have any suggestion/feedbacks/things your not happy about, please do share it in the issue section. We love hearing from you 😁