%load_ext autoreload
%autoreload 2
LlamaIndex¶
LlamaIndex is a data framework for LLM applications to ingest, structure, and access private or domain-specific data. Makes it super easy to connect LLMs with your own data. But in order to figure out the best configuration for llamaIndex and your data you need a object measure of the performance. This is where ragas comes in. Ragas will help you evaluate your QueryEngine
and gives you the confidence to tweak the configuration to get hightest score.
This guide assumes you have familarity with the LlamaIndex framework.
Building the Testset¶
You will need an testset to evaluate your QueryEngine
against. You can either build one yourself or use the Testset Generator Module in Ragas to get started with a small synthetic one.
Let’s see how that works with Llamaindex
# load the documents
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./nyc_wikipedia").load_data()
Now lets init the TestsetGenerator
object with the corresponding generator and critic llms
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# generator with openai models
generator_llm = OpenAI(model="gpt-3.5-turbo-16k")
critic_llm = OpenAI(model="gpt-4")
embeddings = OpenAIEmbedding()
generator = TestsetGenerator.from_llama_index(
generator_llm=generator_llm,
critic_llm=critic_llm,
embeddings=embeddings,
)
Now you are all set to generate the dataset
# generate testset
testset = generator.generate_with_llamaindex_docs(
documents,
test_size=5,
distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
)
Filename and doc_id are the same for all nodes.
df = testset.to_pandas()
df.head()
question | contexts | ground_truth | evolution_type | metadata | episode_done | |
---|---|---|---|---|---|---|
0 | What cultural movement began in New York City ... | [ Others cite the end of the crack epidemic an... | The Harlem Renaissance | simple | [{'file_path': '/home/jjmachan/jjmachan/explod... | True |
1 | What is the significance of New York City's tr... | [ consisting of 51 council members whose distr... | New York City's transportation system is both ... | simple | [{'file_path': '/home/jjmachan/jjmachan/explod... | True |
2 | What factors led to the creation of Central Pa... | [ next ten years with British troops stationed... | Public-minded members of the contemporaneous b... | reasoning | [{'file_path': '/home/jjmachan/jjmachan/explod... | True |
3 | What was the impact of the Treaty of Breda on ... | [ British raids. In 1626, the Dutch colonial D... | The Treaty of Breda confirmed the transfer of ... | multi_context | [{'file_path': '/home/jjmachan/jjmachan/explod... | True |
4 | What role did New York play in the American Re... | [ British raids. In 1626, the Dutch colonial D... | New York played a significant role in the Amer... | simple | [{'file_path': '/home/jjmachan/jjmachan/explod... | True |
with a test dataset to test our QueryEngine
lets now build one and evaluate it.
Building the QueryEngine
¶
To start lets build an VectorStoreIndex
over the New York Citie’s wikipedia page as an example and use ragas to evaluate it.
Since we already loaded the dataset into documents
lets use that.
# build query engine
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.settings import Settings
vector_index = VectorStoreIndex.from_documents(documents)
query_engine = vector_index.as_query_engine()
Lets try an sample question from the generated testset to see if it is working
# convert it to pandas dataset
df = testset.to_pandas()
df["question"][0]
'What cultural movement began in New York City and established the African-American literary canon in the United States?'
response_vector = query_engine.query(df["question"][0])
print(response_vector)
The Harlem Renaissance was the cultural movement that began in New York City and established the African-American literary canon in the United States.
Evaluating the QueryEngine
¶
Now that we have a QueryEngine
for the VectorStoreIndex
we can use the llama_index integration Ragas has to evaluate it.
In order to run an evaluation with Ragas and LlamaIndex you need 3 things
LlamaIndex
QueryEngine
: what we will be evaluatingMetrics: Ragas defines a set of metrics that can measure different aspects of the
QueryEngine
. The available metrics and their meaning can be found hereQuestions: A list of questions that ragas will test the
QueryEngine
against.
first lets generate the questions. Ideally you should use that you see in production so that the distribution of question with which we evaluate matches the distribution of questions seen in production. This ensures that the scores reflect the performance seen in production but to start off we’ll be using a few example question.
Now lets import the metrics we will be using to evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from ragas.metrics.critique import harmfulness
metrics = [
faithfulness,
answer_relevancy,
context_precision,
context_recall,
harmfulness,
]
now lets init the evaluator model
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# using GPT 3.5, use GPT 4 / 4-turbo for better accuracy
evaluator_llm = OpenAI(model="gpt-3.5-turbo")
the evaluate()
function expects a dict of “question” and “ground_truth” for metrics. You can easily convert the testset
to that format
# convert to HF dataset
ds = testset.to_dataset()
ds_dict = ds.to_dict()
ds_dict["question"]
ds_dict["ground_truth"]
['The Harlem Renaissance',
"New York City's transportation system is both complex and extensive, with a comprehensive mass transit system that accounts for one in every three users of mass transit in the United States. The New York City Subway system is the largest rapid transit system in the world, and the city has a high usage of public transport, with a majority of households not owning a car. Due to their reliance on mass transit, New Yorkers spend less of their household income on transportation compared to the national average.",
'Public-minded members of the contemporaneous business elite lobbied for the establishment of Central Park',
'The Treaty of Breda confirmed the transfer of New Amsterdam to English control and the renaming of the settlement as New York. The Duke of York, who would later become King James II and VII, played a significant role in the naming of New York City.',
'New York played a significant role in the American Revolution. The Stamp Act Congress met in New York in October 1765, and the city became a center for the Sons of Liberty organization. Skirmishes and battles took place in and around New York, including the Battle of Long Island and the Battle of Saratoga. The city was occupied by British forces for much of the war, but it was eventually liberated by American troops in 1783.']
Finally lets run the evaluation
from ragas.integrations.llama_index import evaluate
result = evaluate(
query_engine=query_engine,
metrics=metrics,
dataset=ds_dict,
llm=evaluator_llm,
embeddings=OpenAIEmbedding(),
)
n values greater than 1 not support for LlamaIndex LLMs
n values greater than 1 not support for LlamaIndex LLMs
n values greater than 1 not support for LlamaIndex LLMs
n values greater than 1 not support for LlamaIndex LLMs
n values greater than 1 not support for LlamaIndex LLMs
# final scores
print(result)
{'faithfulness': 0.9000, 'answer_relevancy': 0.8993, 'context_precision': 0.9000, 'context_recall': 1.0000, 'harmfulness': 0.0000}
You can convert into a pandas dataframe to run more analysis on it.
result.to_pandas()
question | contexts | answer | ground_truth | faithfulness | answer_relevancy | context_precision | context_recall | harmfulness | |
---|---|---|---|---|---|---|---|---|---|
0 | What cultural movement began in New York City ... | [=== 19th century ===\n\nOver the course of th... | The Harlem Renaissance of literary and cultura... | The Harlem Renaissance | 0.5 | 0.907646 | 0.5 | 1.0 | 0 |
1 | What is the significance of New York City's tr... | [== Transportation ==\n\nNew York City's compr... | New York City's transportation system is signi... | New York City's transportation system is both ... | 1.0 | 0.986921 | 1.0 | 1.0 | 0 |
2 | What factors led to the creation of Central Pa... | [=== 19th century ===\n\nOver the course of th... | Prominent American literary figures lived in N... | Public-minded members of the contemporaneous b... | 1.0 | 0.805014 | 1.0 | 1.0 | 0 |
3 | What was the impact of the Treaty of Breda on ... | [=== Dutch rule ===\n\nA permanent European pr... | The Treaty of Breda resulted in the transfer o... | The Treaty of Breda confirmed the transfer of ... | 1.0 | 0.860931 | 1.0 | 1.0 | 0 |
4 | What role did New York play in the American Re... | [=== Province of New York and slavery ===\n\nI... | New York served as a significant location duri... | New York played a significant role in the Amer... | 1.0 | 0.935846 | 1.0 | 1.0 | 0 |