Evaluation Sample
An evaluation sample is a single structured data instance that is used to asses and measure the performance of your LLM application in specific scenarios. It represents a single unit of interaction or a specific use case that the AI application is expected to handle. In Ragas, evaluation samples are represented using the SingleTurnSample
and MultiTurnSample
classes.
SingleTurnSample
SingleTurnSample represents a single-turn interaction between a user, LLM, and expected results for evaluation. It is suitable for evaluations that involve a single question and answer pair, possibly with additional context or reference information.
Example
The following example demonstrates how to create a SingleTurnSample
instance for evaluating a single-turn interaction in a RAG-based application. In this scenario, a user asks a question, and the AI provides an answer. We’ll create a SingleTurnSample instance to represent this interaction, including any retrieved contexts, reference answers, and evaluation rubrics.
from ragas import SingleTurnSample
# User's question
user_input = "What is the capital of France?"
# Retrieved contexts (e.g., from a knowledge base or search engine)
retrieved_contexts = ["Paris is the capital and most populous city of France."]
# AI's response
response = "The capital of France is Paris."
# Reference answer (ground truth)
reference = "Paris"
# Evaluation rubric
rubric = {
"accuracy": "Correct",
"completeness": "High",
"fluency": "Excellent"
}
# Create the SingleTurnSample instance
sample = SingleTurnSample(
user_input=user_input,
retrieved_contexts=retrieved_contexts,
response=response,
reference=reference,
rubric=rubric
)
MultiTurnSample
MultiTurnSample represents a multi-turn interaction between Human, AI and optionally a Tool and expected results for evaluation. It is suitable for representing conversational agents in more complex interactions for evaluation. In MultiTurnSample
, the user_input
attribute represents a sequence of messages that collectively form a multi-turn conversation between a human user and an AI system. These messages are instances of the classes HumanMessage
, AIMessage
, and ToolMessage
Example
The following example demonstrates how to create a MultiTurnSample
instance for evaluating a multi-turn interaction. In this scenario, a user wants to know the current weather in New York City. The AI assistant will use a weather API tool to fetch the information and respond to the user.
from ragas.messages import HumanMessage, AIMessage, ToolMessage, ToolCall
# User asks about the weather in New York City
user_message = HumanMessage(content="What's the weather like in New York City today?")
# AI decides to use a weather API tool to fetch the information
ai_initial_response = AIMessage(
content="Let me check the current weather in New York City for you.",
tool_calls=[ToolCall(name="WeatherAPI", args={"location": "New York City"})]
)
# Tool provides the weather information
tool_response = ToolMessage(content="It's sunny with a temperature of 75°F in New York City.")
# AI delivers the final response to the user
ai_final_response = AIMessage(content="It's sunny and 75 degrees Fahrenheit in New York City today.")
# Combine all messages into a list to represent the conversation
conversation = [
user_message,
ai_initial_response,
tool_response,
ai_final_response
]
Now, use the conversation to create a MultiTurnSample object, including any reference responses and evaluation rubrics.