Skip to content

LlamaIndex

LlamaIndex is a data framework for LLM applications to ingest, structure, and access private or domain-specific data. Makes it super easy to connect LLMs with your own data. But in order to figure out the best configuration for llamaIndex and your data you need a object measure of the performance. This is where ragas comes in. Ragas will help you evaluate your QueryEngine and gives you the confidence to tweak the configuration to get hightest score.

This guide assumes you have familarity with the LlamaIndex framework.

Building the Testset

You will need an testset to evaluate your QueryEngine against. You can either build one yourself or use the Testset Generator Module in Ragas to get started with a small synthetic one.

Let's see how that works with Llamaindex

load the documents

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./nyc_wikipedia").load_data()

Now lets init the TestsetGenerator object with the corresponding generator and critic llms

from ragas.testset import TestsetGenerator

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# generator with openai models
generator_llm = OpenAI(model="gpt-4o")
embeddings = OpenAIEmbedding(model="text-embedding-3-large")

generator = TestsetGenerator.from_llama_index(
    llm=generator_llm,
    embedding_model=embeddings,
)

Now you are all set to generate the dataset

# generate testset
testset = generator.generate_with_llamaindex_docs(
    documents,
    testset_size=5,
)
df = testset.to_pandas()
df.head()
user_input reference_contexts reference synthesizer_name
0 Why was New York named after the Duke of York? [Etymology ==\n\nIn 1664, New York was named i... New York was named after the Duke of York in 1... AbstractQuerySynthesizer
1 How did the early Europan exploraton and setle... [History ==\n\n\n=== Early history ===\nIn the... The early European exploration and settlement ... AbstractQuerySynthesizer
2 New York City population culture finance diver... [New York City, the most populous city in the ... New York City is a global cultural, financial,... ComparativeAbstractQuerySynthesizer
3 How do the economic aspects of New York City, ... [New York City, the most populous city in the ... New York City's economic aspects, such as its ... ComparativeAbstractQuerySynthesizer
4 What role do biomedical research institutions ... [Education ==\n\n \n\nNew York City has the la... Biomedical research institutions in New York C... SpecificQuerySynthesizer

with a test dataset to test our QueryEngine lets now build one and evaluate it.

Building the QueryEngine

To start lets build an VectorStoreIndex over the New York Citie's wikipedia page as an example and use ragas to evaluate it.

Since we already loaded the dataset into documents lets use that.

# build query engine
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents)

query_engine = vector_index.as_query_engine()

Lets try an sample question from the generated testset to see if it is working

# convert it to pandas dataset
df = testset.to_pandas()
df["user_input"][0]
'Why was New York named after the Duke of York?'
response_vector = query_engine.query(df["user_input"][0])

print(response_vector)
New York was named after the Duke of York because in 1664, the city was named in honor of the Duke of York, who later became King James II of England.

Evaluating the QueryEngine

Now that we have a QueryEngine for the VectorStoreIndex we can use the llama_index integration Ragas has to evaluate it.

In order to run an evaluation with Ragas and LlamaIndex you need 3 things

  1. LlamaIndex QueryEngine: what we will be evaluating
  2. Metrics: Ragas defines a set of metrics that can measure different aspects of the QueryEngine. The available metrics and their meaning can be found here
  3. Questions: A list of questions that ragas will test the QueryEngine against.

first lets generate the questions. Ideally you should use that you see in production so that the distribution of question with which we evaluate matches the distribution of questions seen in production. This ensures that the scores reflect the performance seen in production but to start off we'll be using a few example question.

Now lets import the metrics we will be using to evaluate

# import metrics
from ragas.metrics import (
    Faithfulness,
    AnswerRelevancy,
    ContextPrecision,
    ContextRecall,
)

# init metrics with evaluator LLM
from ragas.llms import LlamaIndexLLMWrapper

evaluator_llm = LlamaIndexLLMWrapper(OpenAI(model="gpt-4o"))
metrics = [
    Faithfulness(llm=evaluator_llm),
    AnswerRelevancy(llm=evaluator_llm),
    ContextPrecision(llm=evaluator_llm),
    ContextRecall(llm=evaluator_llm),
]

the evaluate() function expects a dict of "question" and "ground_truth" for metrics. You can easily convert the testset to that format

# convert to Ragas Evaluation Dataset
ragas_dataset = testset.to_evaluation_dataset()
ragas_dataset
EvaluationDataset(features=['user_input', 'reference_contexts', 'reference'], len=7)

Finally lets run the evaluation

from ragas.integrations.llama_index import evaluate

result = evaluate(
    query_engine=query_engine,
    metrics=metrics,
    dataset=ragas_dataset,
)
# final scores
print(result)
{'faithfulness': 0.9746, 'answer_relevancy': 0.9421, 'context_precision': 0.9286, 'context_recall': 0.6857}

You can convert into a pandas dataframe to run more analysis on it.

result.to_pandas()
user_input retrieved_contexts reference_contexts response reference faithfulness answer_relevancy context_precision context_recall
0 What events led to New York being named after ... [New York City is the headquarters of the glob... [Etymology ==\n\nIn 1664, New York was named i... New York was named in honor of the Duke of Yor... New York was named after the Duke of York in 1... 1.000000 0.950377 1.0 1.0
1 How early European explorers and Native Americ... [=== Dutch rule ===\n\nA permanent European pr... [History ==\n\n\n=== Early history ===\nIn the... Early European explorers established a permane... Early European explorers and Native Americans ... 1.000000 0.896300 1.0 0.8
2 New York City population economy challenges [=== Wealth and income disparity ===\nNew York... [New York City, the most populous city in the ... New York City has faced challenges related to ... New York City, as the most populous city in th... 1.000000 0.915717 1.0 0.0
3 How do the economic aspects of New York City, ... [=== Wealth and income disparity ===\nNew York... [New York City, the most populous city in the ... The economic aspects of New York City, as a gl... New York City's economic aspects as a global c... 0.913043 0.929317 1.0 0.0
4 What are some of the cultural and architectura... [==== Staten Island ====\nStaten Island (Richm... [Geography ==\n\nDuring the Wisconsin glaciati... Brooklyn is known for its cultural diversity, ... Brooklyn is distinct within New York City due ... 1.000000 0.902664 0.5 1.0
5 What measures has New York City implemented to... [==== International events ====\nIn terms of h... [Environment ==\n\n \nEnvironmental issues in ... New York City has implemented various measures... New York City has implemented several measures... 0.909091 1.000000 1.0 1.0
6 What role did New York City play during the Am... [=== Province of New York and slavery ===\n\nI... [History ==\n\n\n=== Early history ===\nIn the... New York City served as a significant military... During the American Revolution, New York City ... 1.000000 1.000000 1.0 1.0