Synthetic test data generation¶
This tutorial is designed to help you create a synthetic evaluation dataset for assessing your RAG pipeline. To achieve this, we will utilize open-ai models, so please ensure you have your OpenAI API key ready and accessible within your environment.
import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"
Documents¶
To begin, we require a collection of documents to generate synthetic Question/Context/Answer samples. Here, we will employ the llama-index document loaders to retrieve documents.
from llama_index import download_loader
SemanticScholarReader = download_loader("SemanticScholarReader")
loader = SemanticScholarReader()
# Narrow down the search space
query_space = "large language models"
# Increase the limit to obtain more documents
documents = loader.load_data(query=query_space, limit=10)
At this point, we have a set of documents at our disposal, which will serve as the basis for creating synthetic Question/Context/Answer triplets.
Data Generation¶
We will now import and use Ragas’ Testsetgenerator
to promptly generate a synthetic test set from the loaded documents.
from ragas.testset import TestsetGenerator
testsetgenerator = TestsetGenerator.from_default()
test_size = 10
testset = testsetgenerator.generate(documents, test_size=test_size)
Subsequently, we can export the results into a Pandas DataFrame.
testset.to_pandas()