Synthetic test data generation¶
This tutorial is designed to help you create a synthetic evaluation dataset for assessing your RAG pipeline. To achieve this, we will utilize open-ai models, so please ensure you have your OpenAI API key ready and accessible within your environment.
import os os.environ["OPENAI_API_KEY"] = "your-openai-key"
To begin, we require a collection of documents to generate synthetic Question/Context/Answer samples. Here, we will employ the llama-index document loaders to retrieve documents.
from llama_index import download_loader SemanticScholarReader = download_loader("SemanticScholarReader") loader = SemanticScholarReader() # Narrow down the search space query_space = "large language models" # Increase the limit to obtain more documents documents = loader.load_data(query=query_space, limit=10)
At this point, we have a set of documents at our disposal, which will serve as the basis for creating synthetic Question/Context/Answer triplets.
We will now import and use Ragas’
Testsetgenerator to promptly generate a synthetic test set from the loaded documents.
from ragas.testset import TestsetGenerator testsetgenerator = TestsetGenerator.from_default() test_size = 10 testset = testsetgenerator.generate(documents, test_size=test_size)
Subsequently, we can export the results into a Pandas DataFrame.