Evaluate a simple LLM application
The purpose of this guide is to illustrate a simple workflow for testing and evaluating an LLM application with ragas
. It assumes minimum knowledge in AI application building and evaluation. Please refer to our installation instruction for installing ragas
Evaluation
In this guide, you will evaluate a text summarization pipeline. The goal is to ensure that the output summary accurately captures all the key details specified in the text, such as growth figures, market insights, and other essential information.
ragas
offers a variety of methods for analyzing the performance of LLM applications, referred to as metrics. Each metric requires a predefined set of data points, which it uses to calculate scores that indicate performance.
Evaluating using a Non-LLM Metric
Here is a simple example that uses BleuScore
score to score summary
from ragas import SingleTurnSample
from ragas.metrics import BleuScore
test_data = {
"user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
"response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
"reference": "The company reported an 8% growth in Q3 2024, primarily driven by strong sales in the Asian market, attributed to strategic marketing and localized products, with continued growth anticipated in the next quarter."
}
metric = BleuScore()
test_data = SingleTurnSample(**test_data)
metric.single_turn_score(test_data)
Output
Here we used:
- A test sample containing
user_input
,response
(the output from the LLM), andreference
(the expected output from the LLM) as data points to evaluate the summary. - A non-LLM metric called BleuScore
As you may observe, this approach has two key limitations:
-
Time-Consuming Preparation: Evaluating the application requires preparing the expected output (
reference
) for each input, which can be both time-consuming and challenging. -
Inaccurate Scoring: Even though the
response
andreference
are similar, the output score was low. This is a known limitation of non-LLM metrics likeBleuScore
.
Info
A non-LLM metric refers to a metric that does not rely on an LLM for evaluation.
To address these issues, let's try an LLM-based metric.
Evaluating using a LLM based Metric
Choose your LLM
Install the langchain-openai package
Ensure you have your OpenAI key ready and available in your environment.
Wrap the LLMs inLangchainLLMWrapper
so that it can be used with ragas.
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
Install the langchain-aws package
Then you have to set your AWS credentials and configurations
config = {
"credentials_profile_name": "your-profile-name", # E.g "default"
"region_name": "your-region-name", # E.g. "us-east-1"
"llm": "your-llm-model-id", # E.g "anthropic.claude-3-5-sonnet-20241022-v2:0"
"embeddings": "your-embedding-model-id", # E.g "amazon.titan-embed-text-v2:0"
"temperature": 0.4,
}
Define your LLMs and wrap them in LangchainLLMWrapper
so that it can be used with ragas.
from langchain_aws import ChatBedrockConverse
from langchain_aws import BedrockEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
evaluator_llm = LangchainLLMWrapper(ChatBedrockConverse(
credentials_profile_name=config["credentials_profile_name"],
region_name=config["region_name"],
base_url=f"https://bedrock-runtime.{config['region_name']}.amazonaws.com",
model=config["llm"],
temperature=config["temperature"],
))
evaluator_embeddings = LangchainEmbeddingsWrapper(BedrockEmbeddings(
credentials_profile_name=config["credentials_profile_name"],
region_name=config["region_name"],
model_id=config["embeddings"],
))
If you want more information on how to use other AWS services, please refer to the langchain-aws documentation.
Install the langchain-openai package
Ensure you have your Azure OpenAI key ready and available in your environment.
import os
os.environ["AZURE_OPENAI_API_KEY"] = "your-azure-openai-key"
# other configuration
azure_config = {
"base_url": "", # your endpoint
"model_deployment": "", # your model deployment name
"model_name": "", # your model name
"embedding_deployment": "", # your embedding deployment name
"embedding_name": "", # your embedding name
}
Define your LLMs and wrap them in LangchainLLMWrapper
so that it can be used with ragas.
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
evaluator_llm = LangchainLLMWrapper(AzureChatOpenAI(
openai_api_version="2023-05-15",
azure_endpoint=azure_config["base_url"],
azure_deployment=azure_config["model_deployment"],
model=azure_config["model_name"],
validate_base_url=False,
))
# init the embeddings for answer_relevancy, answer_correctness and answer_similarity
evaluator_embeddings = LangchainEmbeddingsWrapper(AzureOpenAIEmbeddings(
openai_api_version="2023-05-15",
azure_endpoint=azure_config["base_url"],
azure_deployment=azure_config["embedding_deployment"],
model=azure_config["embedding_name"],
))
If you want more information on how to use other Azure services, please refer to the langchain-azure documentation.
If you are using a different LLM provider and using Langchain to interact with it, you can wrap your LLM in LangchainLLMWrapper
so that it can be used with ragas.
For a more detailed guide, checkout the guide on customizing models.
If you using LlamaIndex, you can use the LlamaIndexLLMWrapper
to wrap your LLM so that it can be used with ragas.
For more information on how to use LlamaIndex, please refer to the LlamaIndex Integration guide.
If your still not able use Ragas with your favorite LLM provider, please let us know by by commenting on this issue and we'll add support for it 🙂.
Evaluation
Here we will use AspectCritic, which an LLM based metric that outputs pass/fail given the evaluation criteria.
from ragas import SingleTurnSample
from ragas.metrics import AspectCritic
test_data = {
"user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
"response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
}
metric = AspectCritic(name="summary_accuracy",llm=evaluator_llm, definition="Verify if the summary is accurate.")
test_data = SingleTurnSample(**test_data)
await metric.single_turn_ascore(test_data)
Output
Success! Here 1 means pass and 0 means fail
Info
There are many other types of metrics that are available in ragas (with and without reference
), and you may also create your own metrics if none of those fits your case. To explore this more checkout more on metrics.
Evaluating on a Dataset
In the examples above, we used only a single sample to evaluate our application. However, evaluating on just one sample is not robust enough to trust the results. To ensure the evaluation is reliable, you should add more test samples to your test data.
Here, we’ll load a dataset from Hugging Face Hub, but you can load data from any source, such as production logs or other datasets. Just ensure that each sample includes all the required attributes for the chosen metric.
In our case, the required attributes are:
- user_input
: The input provided to the application (here the input text report).
- response
: The output generated by the application (here the generated summary).
For example
[
# Sample 1
{
"user_input": "summarise given text\nThe Q2 earnings report revealed a significant 15% increase in revenue, ...",
"response": "The Q2 earnings report showed a 15% revenue increase, ...",
},
# Additional samples in the dataset
....,
# Sample N
{
"user_input": "summarise given text\nIn 2023, North American sales experienced a 5% decline, ...",
"response": "Companies are strategizing to adapt to market challenges and ...",
}
]
from datasets import load_dataset
from ragas import EvaluationDataset
eval_dataset = load_dataset("explodinggradients/earning_report_summary",split="train")
eval_dataset = EvaluationDataset.from_hf_dataset(eval_dataset)
print("Features in dataset:", eval_dataset.features())
print("Total samples in dataset:", len(eval_dataset))
Output
Evaluate using dataset
Output
This score shows that out of all the samples in our test data, only 84% of summaries passes the given evaluation criteria. Now, It s important to see why is this the case.
Export the sample level scores to pandas dataframe
Output
user_input response summary_accuracy
0 summarise given text\nThe Q2 earnings report r... The Q2 earnings report showed a 15% revenue in... 1
1 summarise given text\nIn 2023, North American ... Companies are strategizing to adapt to market ... 1
2 summarise given text\nIn 2022, European expans... Many companies experienced a notable 15% growt... 1
3 summarise given text\nSupply chain challenges ... Supply chain challenges in North America, caus... 1
Viewing the sample-level results in a CSV file, as shown above, is fine for quick checks but not ideal for detailed analysis or comparing results across evaluation runs. For a better experience, use app.ragas.io to view, analyze, and compare evaluation results interactively.
Analyzing Results
For this you may sign up and setup app.ragas.io easily. If not, you may use any alternative tools available to you.
In order to use the app.ragas.io dashboard, you need to have an account on app.ragas.io. If you don't have one, you can sign up for one here. You will also need to generate a Ragas APP token.
Once you have the API key, you can use the upload()
method to export the results to the dashboard.
Now you can view the results in the dashboard by following the link in the output of the upload()
method.
Aligning Metrics
In the example above, we can see that the LLM-based metric mistakenly marks some summary as accurate, even though it missed critical details like growth numbers and market domain. Such mistakes can occur when the metric does not align with your specific evaluation preferences. For example,
To fix these results, ragas provides a way to align the metric with your preferences, allowing it to learn like a machine learning model. Here's how you can do this in three simple steps:
- Annotate: Accept, reject, or edit evaluation results to create training data (at least 15-20 samples).
- Download: Save the annotated data using the
Annotated JSON
button in app.ragas.io. - Train: Use the annotated data to train your custom metric.
To learn more about this, refer to how to train your own metric guide
Download sample annotated JSON
from ragas.config import InstructionConfig, DemonstrationConfig
demo_config = DemonstrationConfig(embedding=evaluator_embeddings)
inst_config = InstructionConfig(llm=evaluator_llm)
metric.train(path="<your-annotated-json.json>", demonstration_config=demo_config, instruction_config=inst_config)
Once trained, you can re-evaluate the same or different test datasets. You should notice that the metric now aligns with your preferences and makes fewer mistakes, improving its accuracy.