Migration from v0.3 to v0.4
Ragas v0.4 introduces a fundamental shift towards an experiment-based architecture. This represents the most significant change since v0.2, moving from isolated metric evaluations to a cohesive experimentation framework where evaluation, analysis, and iteration are tightly integrated.
This architectural change led to several concrete improvements:
- Collections-Based Metrics System - A standardized approach to metrics that work seamlessly within experiments
- Unified LLM Factory System - Simplified LLM initialization with universal provider support
- Modern Prompt System - Function-based prompts that are more composable and reusable
This guide will walk you through the key changes and provide step-by-step migration instructions.
Overview of Major Changes
The shift to experiment-based architecture focuses on three core improvements:
- Experiment-Centric Design - Move from one-off metric runs to structured experimentation workflows with integrated analysis
- Collections-Based Metrics - Metrics designed to work within experiments, returning structured results for better analysis and tracking
- Enhanced LLM & Prompt System - Universal provider support and modern prompt patterns enabling better experimentation
Key Statistics
- Metrics Migrated: 20+ core metrics to the new collections system
- Breaking Changes: 7+ major API changes
- Deprecations: Legacy wrapper classes and old prompt definitions
- New Features: GPT-5/o-series support, automatic constraint handling, universal provider support
Understanding the Experiment-Based Architecture
Before migrating, it helps to understand the shift in thinking:
v0.3 (Metric-Centric):
Each metric run was relatively isolated. You'd run a metric, get a float score, and handle tracking/analysis externally.
v0.4 (Experiment-Centric):
Metrics now work within an experimentation context where evaluation, analysis, and iteration are integrated. This enables:
- Better tracking of metric results with explanations
- Easier comparison across experiment runs
- Built-in support for analyzing metric behavior
- Cleaner workflows for iterating on your system
Migration Path
We recommend migrating in this order:
- Update evaluation approach (Section: Evaluation to Experiment) - Switch from
evaluate()toexperiment() - Update your LLM setup (Section: LLM Initialization)
- Migrate metrics (Section: Metrics Migration)
- Migrate embeddings (Section: Embeddings Migration)
- Update prompts (Section: Prompt System Migration) - If you're customizing prompts
- Update data schemas (Section: Data Schema Changes)
- Refactor custom metrics (Section: Custom Metrics)
Evaluation to Experiment
v0.4 replaces the evaluate() function with an experiment()-based approach to better support iterative evaluation workflows and structured result tracking.
What Changed
The key shift: move from a simple evaluation function (evaluate()) that returns scores to an experiment decorator (@experiment()) that supports structured workflows with built-in tracking and versioning.
Before (v0.3)
from ragas import evaluate
from ragas.metrics.collections import Faithfulness, AnswerRelevancy
# Setup
dataset = ... # Your dataset
metrics = [Faithfulness(llm=llm), AnswerRelevancy(llm=llm)]
# Simple evaluation
result = evaluate(
dataset=dataset,
metrics=metrics,
llm=llm,
embeddings=embeddings
)
print(result) # Returns EvaluationResult with scores
After (v0.4)
from ragas import experiment
from ragas.metrics.collections import Faithfulness, AnswerRelevancy
from pydantic import BaseModel
# Define experiment result structure
class ExperimentResult(BaseModel):
faithfulness: float
answer_relevancy: float
# Create experiment function
@experiment(ExperimentResult)
async def run_evaluation(row):
faithfulness = Faithfulness(llm=llm)
answer_relevancy = AnswerRelevancy(llm=llm)
faith_result = await faithfulness.ascore(
response=row.response,
retrieved_contexts=row.contexts
)
relevancy_result = await answer_relevancy.ascore(
user_input=row.user_input,
response=row.response
)
return ExperimentResult(
faithfulness=faith_result.value,
answer_relevancy=relevancy_result.value
)
# Run experiment
exp_results = await run_evaluation(dataset)
Benefits of Using experiment()
- Structured Results - Define exactly what you want to track
- Per-Row Control - Customize evaluation per sample if needed
- Version Tracking - Optional git integration via
version_experiment() - Iterative Workflows - Easy to modify and re-run experiments
- Better Integration - Works seamlessly with modern metrics and datasets
LLM Initialization
What Changed
The v0.3 system required different factory functions depending on your use case:
instructor_llm_factory()for metrics requiring instructorllm_factory()for general LLM operations- Various wrapper classes for LangChain and LlamaIndex
v0.4 consolidates everything into a single unified factory:
This factory:
- Returns
InstructorBaseRagasLLMwith guaranteed structured outputs - Automatically detects and configures provider-specific constraints
- Supports GPT-5 and o-series models with automatic
temperatureandtop_pconstraints - Works with all major providers: OpenAI, Anthropic, Cohere, Google, Azure, Bedrock, etc.
Before (v0.3)
from ragas.llms import instructor_llm_factory, llm_factory
from openai import AsyncOpenAI
# For metrics that need instructor
llm = instructor_llm_factory("openai", model="gpt-4o-mini", client=AsyncOpenAI(api_key="..."))
# Or, the old way (not recommended, still supported in 0.3)
client = AsyncOpenAI(api_key="sk-...")
llm = llm_factory("openai", model="gpt-4o-mini", client=client)
After (v0.4)
from ragas.llms import llm_factory
from openai import AsyncOpenAI
# Single unified approach - works everywhere
client = AsyncOpenAI(api_key="sk-...")
llm = llm_factory("gpt-4o-mini", client=client)
Key differences:
| Aspect | v0.3 | v0.4 |
|---|---|---|
| Factory function | instructor_llm_factory() or llm_factory() |
llm_factory() |
| Provider detection | Manual via provider string | Automatic from model name |
| Return type | BaseRagasLLM (various) |
InstructorBaseRagasLLM |
| Constraint handling | Manual configuration | Automatic for GPT-5/o-series |
| Async client required | Yes | Yes |
Migration Steps
-
Update imports:
-
Replace factory calls:
-
Update with other providers (model name detection works automatically):
LLM Wrapper Classes (Deprecated)
If you were using wrapper classes, they are now deprecated and will be removed in the future:
Migration: Replace wrapper initialization with direct llm_factory() calls. The factory now handles provider detection automatically.
Metrics Migration
Why Metrics Changed
The shift to experiment-based architecture required metrics to integrate better with the experimentation workflow:
- Structured Results: Metrics now return
MetricResultobjects (with score + reasoning) instead of raw floats, enabling richer analysis and tracking within experiments - Keyword Arguments: Moving from sample objects to direct keyword arguments makes metrics easier to compose and integrate with experimental pipelines
- Standardized Input/Output: Collections-based metrics follow a consistent pattern, making it easier to build meta-analysis and experimentation features on top
Architectural Changes
The metrics system has been completely redesigned to support experiment workflows. Here are the core differences:
Base Class Changes
| Aspect | v0.3 | v0.4 |
|---|---|---|
| Import | from ragas.metrics import Metric |
from ragas.metrics.collections import Metric |
| Base Class | MetricWithLLM, SingleTurnMetric |
BaseMetric (from collections) |
| Scoring Method | async def single_turn_ascore(sample: SingleTurnSample) |
async def ascore(**kwargs) |
| Input Type | SingleTurnSample objects |
Individual keyword arguments |
| Output Type | float score |
MetricResult (with .value and optional .reason) |
| LLM Parameter | Required at initialization | Required at initialization |
Scoring Workflow
v0.3 Approach:
# 1. Create a sample object containing all data
sample = SingleTurnSample(
user_input="What is AI?",
response="AI is artificial intelligence...",
retrieved_contexts=["Context 1", "Context 2"],
ground_truths=["AI definition"]
)
# 2. Call metric with the sample
metric = Faithfulness(llm=llm)
score = await metric.single_turn_ascore(sample) # Returns: 0.85
v0.4 Approach:
# 1. Call metric with individual arguments
metric = Faithfulness(llm=llm)
result = await metric.ascore(
user_input="What is AI?",
response="AI is artificial intelligence...",
retrieved_contexts=["Context 1", "Context 2"]
)
# 2. Access result properties
print(result.value) # Score: 0.85 (float)
print(result.reason) # Optional explanation
Available Metrics in v0.4
The following metrics have been successfully migrated to the collections system in v0.4:
RAG Evaluation Metrics
- Faithfulness - Is the response grounded in retrieved context? (v0.3.9+)
- AnswerRelevancy - Is the response relevant to the user query? (v0.3.9+)
- AnswerCorrectness - Does the response match the reference answer? (v0.3.9+)
- AnswerAccuracy - Is the answer factually accurate?
- ContextPrecision - Are retrieved contexts ranked by relevance? (v0.3.9+)
- With reference:
ContextPrecisionWithReference - Without reference:
ContextPrecisionWithoutReference - Legacy name:
ContextUtilization(now a wrapper for ContextPrecisionWithoutReference) - ContextRecall - Are all relevant contexts successfully retrieved? (v0.3.9+)
- ContextRelevance - What percentage of retrieved context is relevant? (v0.3.9+)
- ContextEntityRecall - Are important entities from reference in context? (v0.3.9+)
- NoiseSensitivity - How robust is the metric to irrelevant context? (v0.3.9+)
- ResponseGroundedness - Are all claims grounded in retrieved context?
Text Comparison Metrics
- SemanticSimilarity - Do two texts have similar semantic meaning? (v0.3.9+)
- FactualCorrectness - Are factual claims verified correctly? (v0.3.9+)
- BleuScore - Bilingual evaluation understudy score (v0.3.9+)
- RougeScore - Recall-oriented understudy for gisting evaluation (v0.3.9+)
String-Based Metrics (Non-LLM)
- ExactMatch - Exact string matching
- StringPresence - Substring presence checking
- LevenshteinDistance - Edit distance similarity
- MatchingSubstrings - Count of matching substrings
- NonLLMStringSimilarity - Various string similarity algorithms
Summary Metrics
- SummaryScore - Overall summary quality assessment (v0.3.9+)
Removed Metrics (No Longer Available)
- AspectCritic - Use
@discrete_metric()decorator instead - SimpleCriteria - Use
@discrete_metric()decorator instead - AnswerSimilarity - Use
SemanticSimilarityinstead
Agent & Tool Metrics (Migrated)
- ToolCallAccuracy -
ragas.metrics.collections.ToolCallAccuracy - ToolCallF1 -
ragas.metrics.collections.ToolCallF1 - TopicAdherence -
ragas.metrics.collections.TopicAdherence - AgentGoalAccuracy -
ragas.metrics.collections.AgentGoalAccuracy
SQL & Data Metrics (Migrated)
- DataCompy Score -
ragas.metrics.collections.DataCompyScore - SQL Query Equivalence -
ragas.metrics.collections.SQLSemanticEquivalence
Rubric Metrics (Migrated)
- DomainSpecificRubrics -
ragas.metrics.collections.DomainSpecificRubrics - InstanceSpecificRubrics -
ragas.metrics.collections.InstanceSpecificRubrics
String & NLP Metrics (Migrated)
- CHRF Score -
ragas.metrics.collections.CHRFScore(character n-gram F-score) - Quoted Spans Alignment -
ragas.metrics.collections.QuotedSpansAlignment(citation verification)
Specialized Metrics (Not Yet Migrated)
- Multi-Modal Faithfulness - Still on old architecture (Pending migration)
- Multi-Modal Relevance - Still on old architecture (Pending migration)
Migration Status
Most core metrics have been migrated to the collections system. Only multi-modal metrics remain on the legacy architecture.
The remaining metrics will be migrated in future v0.4.x releases. You can still use legacy metrics with the old API, though they will show deprecation warnings.
Step-by-Step Migration
Step 1: Update Imports
# v0.4
from ragas.metrics.collections import (
Faithfulness,
AnswerRelevancy,
ContextPrecision,
ContextRecall
)
Step 2: Initialize Metrics (No Change Required)
Step 3: Update Metric Scoring Calls
Replace single_turn_ascore(sample) with ascore(**kwargs):
# v0.3
sample = SingleTurnSample(
user_input="What is AI?",
response="AI is artificial intelligence.",
retrieved_contexts=["AI is a technology..."],
ground_truths=["AI definition"]
)
score = await metric.single_turn_ascore(sample)
print(score) # Output: 0.85
# v0.4
result = await metric.ascore(
user_input="What is AI?",
response="AI is artificial intelligence.",
retrieved_contexts=["AI is a technology..."]
)
print(result.value) # Output: 0.85
print(result.reason) # Optional: "Response is faithful to context"
Step 4: Handle MetricResult Objects
In v0.4, metrics return MetricResult objects instead of raw floats:
from ragas.metrics.collections.base import MetricResult
result = await metric.ascore(...)
# Access the score
score_value = result.value # float between 0 and 1
# Access the explanation (if available)
if result.reason:
print(f"Reason: {result.reason}")
# Convert to float for compatibility
score_float = float(result.value)
Metric-Specific Migrations
Faithfulness
Before (v0.3):
sample = SingleTurnSample(
user_input="What is machine learning?",
response="ML is a subset of AI.",
retrieved_contexts=["ML involves algorithms..."]
)
score = await metric.single_turn_ascore(sample)
After (v0.4):
result = await metric.ascore(
user_input="What is machine learning?",
response="ML is a subset of AI.",
retrieved_contexts=["ML involves algorithms..."]
)
score = result.value
AnswerRelevancy
Before (v0.3):
sample = SingleTurnSample(
user_input="What is Python?",
response="Python is a programming language..."
)
score = await metric.single_turn_ascore(sample)
After (v0.4):
result = await metric.ascore(
user_input="What is Python?",
response="Python is a programming language..."
)
score = result.value
AnswerCorrectness
Note: This metric now uses reference instead of ground_truths:
Before (v0.3):
sample = SingleTurnSample(
user_input="What is AI?",
response="AI is artificial intelligence.",
ground_truths=["AI is artificial intelligence and machine learning."]
)
score = await metric.single_turn_ascore(sample)
After (v0.4):
result = await metric.ascore(
user_input="What is AI?",
response="AI is artificial intelligence.",
reference="AI is artificial intelligence and machine learning."
)
score = result.value
ContextPrecision
Before (v0.3):
sample = SingleTurnSample(
user_input="What is RAG?",
response="RAG improves LLM accuracy.",
retrieved_contexts=["RAG = Retrieval Augmented Generation...", "..."],
ground_truths=["RAG definition"]
)
score = await metric.single_turn_ascore(sample)
After (v0.4):
result = await metric.ascore(
user_input="What is RAG?",
response="RAG improves LLM accuracy.",
retrieved_contexts=["RAG = Retrieval Augmented Generation...", "..."],
reference="RAG definition"
)
score = result.value
Prompt System Migration
Why Prompts Changed
The shift to a modular architecture means prompts are now first-class components that can be:
- Customized per metric - Each metric has a well-defined prompt interface
- Type-safe - Input/Output models define exact structure expected
- Reusable - Prompt classes follow a consistent pattern across metrics
- Testable - Prompts can be generated and inspected independently
v0.3 used simple string-based or dataclass prompts scattered throughout metrics. v0.4 consolidates them into a unified BasePrompt architecture with dedicated input/output models.
Architectural Changes
Base Prompt System
| Aspect | v0.3 | v0.4 |
|---|---|---|
| Prompt Definition | PydanticPrompt dataclasses or strings |
BasePrompt classes with to_string() method |
| Input/Output Types | Generic Pydantic models | Metric-specific Input/Output models |
| Access Method | Scatter across metric code | Centralized in metric's util.py module |
| Customization | Difficult, requires deep changes | Simple subclassing with instruction and examples properties |
| Organization | Mixed in metric files | Organized in separate util.py files |
Available Metric Prompts in v0.4
The following metrics now have well-defined, customizable prompts:
- Faithfulness -
FaithfulnessPrompt,FaithfulnessInput,FaithfulnessOutput - Context Recall -
ContextRecallPrompt,ContextRecallInput,ContextRecallOutput - Context Precision -
ContextPrecisionPrompt,ContextPrecisionInput,ContextPrecisionOutput - Answer Relevancy -
AnswerRelevancyPrompt,AnswerRelevancyInput,AnswerRelevancyOutput - Answer Correctness -
AnswerCorrectnessPrompt,AnswerCorrectnessInput,AnswerCorrectnessOutput - Response Groundedness -
ResponseGroundednessPrompt,ResponseGroundednessInput,ResponseGroundednessOutput - Answer Accuracy -
AnswerAccuracyPrompt,AnswerAccuracyInput,AnswerAccuracyOutput - Context Relevance -
ContextRelevancePrompt,ContextRelevanceInput,ContextRelevanceOutput - Context Entity Recall -
ContextEntityRecallPrompt,ContextEntityRecallInput,ContextEntityRecallOutput - Factual Correctness -
ClaimDecompositionPrompt,VerificationPrompt, with associated Input/Output models - Noise Sensitivity -
NoiseAugmentationPromptwith associated models - Summary Score -
SummaryScorePrompt,SummaryScoreInput,SummaryScoreOutput
Step-by-Step Migration
Step 1: Access Prompts in Your Metrics
from ragas.metrics.collections import Faithfulness
from ragas.llms import llm_factory
# Create metric instance
metric = Faithfulness(llm=llm)
# Access the prompt object
print(metric.prompt) # <ragas.metrics.collections.faithfulness.util.FaithfulnessPrompt>
Step 2: View Prompt Strings
from ragas.metrics.collections.faithfulness.util import FaithfulnessInput
# Create sample input
sample_input = FaithfulnessInput(
response="The Eiffel Tower is in Paris.",
context="The Eiffel Tower is located in Paris, France."
)
# Generate prompt string
prompt_string = metric.prompt.to_string(sample_input)
print(prompt_string)
Step 3: Customize Prompts (If Needed)
Option A: Subclass the default prompt
from ragas.metrics.collections import Faithfulness
from ragas.metrics.collections.faithfulness.util import FaithfulnessPrompt
# Create custom prompt by subclassing
class CustomFaithfulnessPrompt(FaithfulnessPrompt):
@property
def instruction(self):
return """Your custom instruction here."""
# Apply to metric
metric = Faithfulness(llm=llm)
metric.prompt = CustomFaithfulnessPrompt()
Option B: Customize examples for domain-specific evaluation
from ragas.metrics.collections.faithfulness.util import (
FaithfulnessInput,
FaithfulnessOutput,
FaithfulnessPrompt,
StatementFaithfulnessAnswer,
)
class DomainSpecificPrompt(FaithfulnessPrompt):
examples = [
(
FaithfulnessInput(
response="ML uses statistical techniques.",
context="Machine learning is a field that uses algorithms to learn from data.",
),
FaithfulnessOutput(
statements=[
StatementFaithfulnessAnswer(
statement="ML uses statistical techniques.",
reason="Related to learning from data, but context doesn't explicitly mention statistical techniques.",
verdict=0
),
]
),
),
]
# Apply custom prompt
metric = Faithfulness(llm=llm)
metric.prompt = DomainSpecificPrompt()
Common Prompt Customizations
Changing Instructions
Most metrics allow overriding the instruction property:
class StrictFaithfulnessPrompt(FaithfulnessPrompt):
@property
def instruction(self):
return """Be very strict when judging faithfulness.
Only mark statements as faithful (verdict=1) if they are directly stated or strongly implied."""
Adding Domain Examples
Domain-specific examples significantly improve metric accuracy (10-20% improvement):
Changing Output Format
For advanced customization, subclass the prompt and override the to_string() method:
class CustomPrompt(FaithfulnessPrompt):
def to_string(self, input: FaithfulnessInput) -> str:
# Custom prompt generation logic
return "..."
Verifying Custom Prompts
Always verify your custom prompts before using them:
# Test prompt generation
sample_input = FaithfulnessInput(
response="Test response.",
context="Test context."
)
custom_metric = Faithfulness(llm=llm)
custom_metric.prompt = MyCustomPrompt()
# View the generated prompt
prompt_string = custom_metric.prompt.to_string(sample_input)
print(prompt_string)
# Then use it for evaluation
result = await custom_metric.ascore(
response="Test response.",
context="Test context."
)
Migration from v0.3 Custom Prompts
If you had custom prompts in v0.3 using PydanticPrompt:
Before (v0.3) - Dataclass approach:
from ragas.prompt.pydantic_prompt import PydanticPrompt
from pydantic import BaseModel
class MyInput(BaseModel):
response: str
context: str
class MyOutput(BaseModel):
is_faithful: bool
class MyPrompt(PydanticPrompt[MyInput, MyOutput]):
instruction = "Check if response is faithful to context"
input_model = MyInput
output_model = MyOutput
examples = [...]
After (v0.4) - BasePrompt approach:
from ragas.metrics.collections.base import BasePrompt
from pydantic import BaseModel
class MyInput(BaseModel):
response: str
context: str
class MyOutput(BaseModel):
is_faithful: bool
class MyPrompt(BasePrompt):
@property
def instruction(self):
return "Check if response is faithful to context"
@property
def input_model(self):
return MyInput
@property
def output_model(self):
return MyOutput
@property
def examples(self):
return [...]
def to_string(self, input: MyInput) -> str:
# Generate prompt string from input
return f"Check if this is faithful: {input.response}"
Language Adaptation with BasePrompt.adapt()
v0.4 introduces the adapt() method on BasePrompt instances for language translation, replacing the deprecated PromptMixin.adapt_prompts() approach.
Before (v0.3) - PromptMixin Approach
from ragas.prompt.mixin import PromptMixin
from ragas.metrics import Faithfulness
# Metrics inherited from PromptMixin to use adapt_prompts
class MyFaithfulness(Faithfulness, PromptMixin):
pass
metric = MyFaithfulness(llm=llm)
# Adapt ALL prompts to another language
adapted_prompts = await metric.adapt_prompts(
language="spanish",
llm=llm,
adapt_instruction=True
)
# Apply all adapted prompts
metric.set_prompts(**adapted_prompts)
Issues with v0.3 approach: - Required mixin inheritance (tightly coupled) - All prompts adapted together (inflexible) - Mixin methods scattered across codebase
After (v0.4) - BasePrompt.adapt() Method
from ragas.metrics.collections import Faithfulness
# Create metric with default prompt
metric = Faithfulness(llm=llm)
# Adapt individual prompt to another language
adapted_prompt = await metric.prompt.adapt(
target_language="spanish",
llm=llm,
adapt_instruction=True
)
# Apply adapted prompt
metric.prompt = adapted_prompt
# Use metric with adapted language
result = await metric.ascore(
response="...",
retrieved_contexts=[...]
)
Save and load prompts will be available in a future version of v0.4.x using BasePrompt. Currently, PromptMixin only has it.
Language Adaptation Examples
Adapt without instruction text (lightweight):
from ragas.metrics.collections import AnswerRelevancy
metric = AnswerRelevancy(llm=llm)
# Only update language field, keep instruction in English
adapted_prompt = await metric.prompt.adapt(
target_language="french",
llm=llm,
adapt_instruction=False # Default - just updates language
)
metric.prompt = adapted_prompt
print(metric.prompt.language) # "french"
Adapt with instruction translation (full translation):
# Translate both instruction and examples
adapted_prompt = await metric.prompt.adapt(
target_language="german",
llm=llm,
adapt_instruction=True # Translate instruction text too
)
metric.prompt = adapted_prompt
# Examples are also automatically translated
# Both instruction and examples in German now
Adapt custom prompts:
from ragas.metrics.collections.faithfulness.util import FaithfulnessPrompt
class CustomFaithfulnessPrompt(FaithfulnessPrompt):
@property
def instruction(self):
return "Custom instruction in English"
prompt = CustomFaithfulnessPrompt(language="english")
# Adapt to Italian
adapted = await prompt.adapt(
target_language="italian",
llm=llm,
adapt_instruction=True
)
# Check language was updated
assert adapted.language == "italian"
Migration from v0.3 to v0.4
Step 1: Remove PromptMixin inheritance
# v0.3
from ragas.prompt.mixin import PromptMixin
from ragas.metrics import Faithfulness
class MyMetric(Faithfulness, PromptMixin): # ← Remove PromptMixin
pass
# v0.4
from ragas.metrics.collections import Faithfulness
# No mixin needed - just use the metric directly
metric = Faithfulness(llm=llm)
Step 2: Replace adapt_prompts() with adapt()
# v0.3
adapted_prompts = await metric.adapt_prompts(
language="spanish",
llm=llm,
adapt_instruction=True
)
metric.set_prompts(**adapted_prompts)
# v0.4
adapted_prompt = await metric.prompt.adapt(
target_language="spanish",
llm=llm,
adapt_instruction=True
)
metric.prompt = adapted_prompt
Complete Migration Example
Before (v0.3):
from ragas.prompt.mixin import PromptMixin
from ragas.metrics import Faithfulness, AnswerRelevancy
class MyMetrics(Faithfulness, AnswerRelevancy, PromptMixin):
pass
# Setup
metrics = MyMetrics(llm=llm)
# Adapt multiple metrics to Spanish
adapted = await metrics.adapt_prompts(
language="spanish",
llm=best_llm,
adapt_instruction=True
)
metrics.set_prompts(**adapted)
metrics.save_prompts("./spanish_prompts")
After (v0.4):
from ragas.metrics.collections import Faithfulness, AnswerRelevancy
# Setup individual metrics
faith_metric = Faithfulness(llm=llm)
answer_metric = AnswerRelevancy(llm=llm)
# Adapt each metric's prompt independently
faith_adapted = await faith_metric.prompt.adapt(
target_language="spanish",
llm=best_llm,
adapt_instruction=True
)
faith_metric.prompt = faith_adapted
answer_adapted = await answer_metric.prompt.adapt(
target_language="spanish",
llm=best_llm,
adapt_instruction=True
)
answer_metric.prompt = answer_adapted
# Use metrics with adapted prompts
faith_result = await faith_metric.ascore(...)
answer_result = await answer_metric.ascore(...)
Data Schema Changes
SingleTurnSample Updates
The SingleTurnSample schema has been updated with breaking changes:
ground_truths → reference
The ground_truths parameter has been renamed to reference across the board:
Before (v0.3):
sample = SingleTurnSample(
user_input="...",
response="...",
ground_truths=["correct answer"] # List of strings
)
After (v0.4):
sample = SingleTurnSample(
user_input="...",
response="...",
reference="correct answer" # Single string
)
- v0.3 used
ground_truthsas a list - v0.4 uses
referenceas a single string - For multiple references, use separate evaluation runs
Updated Schema
from ragas import SingleTurnSample
# v0.4 complete sample
sample = SingleTurnSample(
user_input="What is AI?", # Required
response="AI is artificial intelligence.", # Required
retrieved_contexts=["Context 1", "Context 2"], # Optional
reference="Correct definition of AI" # Optional (was ground_truths)
)
EvaluationDataset Updates
If you're using EvaluationDataset, update your data loading:
Before (v0.3):
dataset = EvaluationDataset(
samples=[
SingleTurnSample(
user_input="Q1",
response="A1",
ground_truths=["correct"]
)
]
)
After (v0.4):
dataset = EvaluationDataset(
samples=[
SingleTurnSample(
user_input="Q1",
response="A1",
reference="correct"
)
]
)
If loading from CSV/JSON, update your data files:
Before (v0.3) CSV format:
After (v0.4) CSV format:
Custom Metrics
For Metrics Using Collections-Based Architecture
If you've already written custom metrics extending BaseMetric from collections, minimal changes are needed:
from ragas.metrics.collections.base import BaseMetric, MetricResult
from pydantic import BaseModel
class MyCustomMetric(BaseMetric):
name: str = "my_metric"
dimensions: list[str] = ["my_dimension"]
async def ascore(self, **kwargs) -> MetricResult:
# Your metric logic
score = 0.85
reason = "Explanation of the score"
return MetricResult(value=score, reason=reason)
Key considerations:
- Extend
BaseMetric, not oldMetricWithLLM - Implement
async def ascore(**kwargs)instead ofsingle_turn_ascore(sample) - Return
MetricResultobjects, not raw floats - Use keyword arguments instead of
SingleTurnSample
For Metrics Using Legacy Architecture
If you have custom metrics extending SingleTurnMetric or MetricWithLLM:
# v0.3 - Legacy approach
from ragas.metrics.base import MetricWithLLM
class MyMetric(MetricWithLLM):
async def single_turn_ascore(self, sample: SingleTurnSample) -> float:
# Extract values from sample
user_input = sample.user_input
response = sample.response
contexts = sample.retrieved_contexts or []
# Your logic
return 0.85
Migration path:
- Extend
BaseMetricfrom collections instead - Change method signature to use keyword arguments
- Return
MetricResultinstead of float - Add
dimensionsproperty if not present
# v0.4 - Collections approach
from ragas.metrics.collections.base import BaseMetric, MetricResult
class MyMetric(BaseMetric):
name: str = "my_metric"
dimensions: list[str] = ["quality"]
async def ascore(self,
user_input: str,
response: str,
retrieved_contexts: list[str] | None = None,
**kwargs) -> MetricResult:
# Use keyword arguments directly
contexts = retrieved_contexts or []
# Your logic
score = 0.85
return MetricResult(value=score, reason="Optional explanation")
Prompt System Updates
v0.3 - Dataclass-Based Prompts
from ragas.prompt.pydantic_prompt import PydanticPrompt
from pydantic import BaseModel
class Input(BaseModel):
query: str
document: str
class Output(BaseModel):
is_relevant: bool
class RelevancePrompt(PydanticPrompt[Input, Output]):
instruction = "Is the document relevant to the query?"
input_model = Input
output_model = Output
examples = [...]
v0.4 - Function-Based Prompts
The new approach uses simple functions:
def relevance_prompt(query: str, document: str) -> str:
return f"""Determine if the document is relevant to the query.
Query: {query}
Document: {document}
Respond with YES or NO."""
Benefits:
- Simpler and more composable
- No boilerplate class definitions
- Easier to test and modify
- Native Python type hints
Migration:
- Identify where you define prompts in custom metrics
- Convert dataclass definitions to functions
- Update metric to use the function directly
Removed Features
The following features have been completely removed from v0.4 and will cause errors if used:
Functions
instructor_llm_factory() - Removed entirely
- Merged into:
llm_factory()function - Migration: Replace all calls to
instructor_llm_factory()withllm_factory() - Impact: Direct breaking change, no fallback
Before (v0.3) - No longer works:
After (v0.4) - Use this instead:
Metrics
Three metrics have been completely removed from the collections API. They are no longer available and have no direct replacement:
1. AspectCritic - Removed
- Reason: Replaced by more flexible discrete metric pattern
- Alternative: Use
@discrete_metric()decorator for custom aspect evaluation - Usage:
2. SimpleCriteria - Removed
- Reason: Replaced by more flexible discrete metric pattern
- Alternative: Use
@discrete_metric()decorator for custom criteria - Usage:
3. AnswerSimilarity - Removed (Redundant)
- Reason: Functionality fully covered by
SemanticSimilarity - Direct replacement:
SemanticSimilarity - Usage:
Deprecated Methods (Removed in v0.4)
Metric.ascore() and Metric.score() - Removed
- When removed: Marked for removal in v0.3, removed in v0.4
- Why: Replaced by collections-based
ascore(**kwargs)pattern - Migration: Use collections metrics instead
Legacy sample-based methods - Removed
single_turn_ascore(sample: SingleTurnSample)- Only on legacy metrics- Replace with: Collections metrics using
ascore(**kwargs)
Deprecated Features
These features still work but show deprecation warnings. They will be removed in a future release.
evaluate() Function - Deprecated
- Status: Still works but discouraged
- Reason: Replaced by
@experiment()decorator for better structured workflows - Migration: See Evaluation to Experiment section
Before (v0.3) - Deprecated:
from ragas import evaluate
result = evaluate(dataset=dataset, metrics=metrics, llm=llm, embeddings=embeddings)
After (v0.4) - Recommended:
from ragas import experiment
from pydantic import BaseModel
class Results(BaseModel):
score: float
@experiment(Results)
async def run(row):
result = await metric.ascore(**row.dict())
return Results(score=result.value)
result = await run(dataset)
LLM Wrapper Classes
LangchainLLMWrapper - Deprecated
- Status: Still works but discouraged
- Deprecation warning:
- Migration: Use
llm_factory()with native client instead
Before (v0.3) - Deprecated:
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
langchain_llm = ChatOpenAI(model="gpt-4o")
ragas_llm = LangchainLLMWrapper(langchain_llm)
After (v0.4) - Recommended:
from ragas.llms import llm_factory
from openai import AsyncOpenAI
client = AsyncOpenAI(api_key="...")
ragas_llm = llm_factory("gpt-4o", client=client)
LlamaIndexLLMWrapper - Deprecated
- Status: Still works but discouraged
- Similar warning as LangchainLLMWrapper
- Migration: Use
llm_factory()with native client
Before (v0.3) - Deprecated:
from ragas.llms import LlamaIndexLLMWrapper
from llama_index.llms.openai import OpenAI
llamaindex_llm = OpenAI(model="gpt-4o")
ragas_llm = LlamaIndexLLMWrapper(llamaindex_llm)
After (v0.4) - Recommended:
from ragas.llms import llm_factory
from openai import AsyncOpenAI
client = AsyncOpenAI(api_key="...")
ragas_llm = llm_factory("gpt-4o", client=client)
Embeddings Migration
LangchainEmbeddingsWrapper & LlamaIndexEmbeddingsWrapper - Deprecated
- Status: Still work but show deprecation warnings
- Reason: Replaced by native embedding providers with direct client integration
- Migration: See Embeddings Migration section
v0.4 replaces wrapper classes with native embedding providers that integrate directly with client libraries instead of using LangChain wrappers.
What Changed
| Aspect | v0.3 | v0.4 |
|---|---|---|
| Class | LangchainEmbeddingsWrapper, LlamaIndexEmbeddingsWrapper |
OpenAIEmbeddings, GoogleEmbeddings, HuggingFaceEmbeddings |
| Client | LangChain/LlamaIndex wrapper | Native client (OpenAI, Google, etc.) |
| Methods | embed_query(), embed_documents() |
embed_text(), embed_texts() |
| Setup | Wrap existing LangChain object | Pass native client directly |
OpenAI Migration
Before (v0.3):
from langchain_openai import OpenAIEmbeddings as LangChainEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper
embeddings = LangchainEmbeddingsWrapper(
LangChainEmbeddings(api_key="sk-...")
)
embedding = embeddings.embed_query("text")
After (v0.4):
from openai import AsyncOpenAI
from ragas.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(
client=AsyncOpenAI(api_key="sk-..."),
model="text-embedding-3-small"
)
embedding = embeddings.embed_text("text") # Different method name
Google Embeddings Migration
Before (v0.3):
from langchain_community.embeddings import VertexAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper
embeddings = LangchainEmbeddingsWrapper(
VertexAIEmbeddings(model_name="textembedding-gecko@001", project="my-project")
)
After (v0.4):
from ragas.embeddings import GoogleEmbeddings
embeddings = GoogleEmbeddings(
model="text-embedding-004",
use_vertex=True,
project_id="my-project"
)
HuggingFace Migration
Before (v0.3):
from ragas.embeddings import HuggingfaceEmbeddings
embeddings = HuggingfaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
After (v0.4):
from ragas.embeddings import HuggingFaceEmbeddings # Capitalization changed
embeddings = HuggingFaceEmbeddings(
model="sentence-transformers/all-MiniLM-L6-v2",
device="cuda" # Optional GPU acceleration
)
Using embedding_factory()
Before (v0.3):
from ragas.embeddings import embedding_factory
embeddings = embedding_factory() # Defaults to OpenAI
After (v0.4):
from ragas.embeddings import embedding_factory
from openai import AsyncOpenAI
embeddings = embedding_factory(
provider="openai",
model="text-embedding-3-small",
client=AsyncOpenAI(api_key="sk-...")
)
Prompt System
Dataclass-based prompts (PydanticPrompt) - Deprecated
- Status: Legacy prompts still work but discouraged
- Deprecation: Modular BasePrompt architecture is now preferred
- Migration: See Prompt System Migration section
Before (v0.3) - Deprecated approach:
from ragas.prompt.pydantic_prompt import PydanticPrompt
from pydantic import BaseModel
class Input(BaseModel):
query: str
class Output(BaseModel):
is_relevant: bool
class RelevancePrompt(PydanticPrompt[Input, Output]):
instruction = "Is this relevant?"
input_model = Input
output_model = Output
After (v0.4) - Recommended approach:
# Use BasePrompt classes instead - see Prompt System Migration section
from ragas.metrics.collections.faithfulness.util import FaithfulnessPrompt
class CustomPrompt(FaithfulnessPrompt):
@property
def instruction(self):
return "Your custom instruction here"
Legacy Metric Methods
single_turn_ascore(sample) - Deprecated
- Status: Only on legacy (non-collections) metrics
- Deprecation: Use collections metrics with
ascore()instead - Timeline: Will be removed in future releases when all metrics migrate
Before (v0.3) - Deprecated:
sample = SingleTurnSample(user_input="...", response="...", ...)
score = await metric.single_turn_ascore(sample)
After (v0.4) - Recommended:
ContextUtilization
ContextUtilization is now a wrapper around ContextPrecisionWithoutReference for backward compatibility:
Before (v0.3):
from ragas.metrics import ContextUtilization
metric = ContextUtilization(llm=llm)
score = await metric.single_turn_ascore(sample)
After (v0.4):
from ragas.metrics.collections import ContextUtilization
# or use the modern name directly:
from ragas.metrics.collections import ContextPrecisionWithoutReference
metric = ContextUtilization(llm=llm) # Still works (wrapper)
# or
metric = ContextPrecisionWithoutReference(llm=llm) # Preferred
result = await metric.ascore(
user_input="...",
response="...",
retrieved_contexts=[...]
)
score = result.value
Breaking Changes Summary
Here's a complete list of breaking changes between v0.3 and v0.4:
| Change | v0.3 | v0.4 | Migration |
|---|---|---|---|
| Evaluation approach | evaluate() function |
@experiment() decorator |
See Evaluation to Experiment |
| Metrics location | ragas.metrics |
ragas.metrics.collections |
Update import paths |
| Scoring method | single_turn_ascore(sample) |
ascore(**kwargs) |
Change method calls |
| Score return type | float |
MetricResult |
Use .value property |
| LLM factory | instructor_llm_factory() |
llm_factory() |
Use unified factory |
| Embeddings approach | Wrapper classes (LangChain) | Native providers | See Embeddings Migration |
| Embedding methods | embed_query(), embed_documents() |
embed_text(), embed_texts() |
Update method calls |
| ground_truths param | ground_truths: list[str] |
reference: str |
Rename, change type |
| Sample type | SingleTurnSample |
SingleTurnSample (updated) |
Update sample creation |
| Prompt system | Dataclass-based | Function-based | Refactor custom prompts |
Deprecations and Removals
Removed in v0.4
These features have been completely removed and will cause errors:
instructor_llm_factory()- Usellm_factory()instead- AspectCritic from collections - No direct replacement
- SimpleCriteriaScore from collections - No direct replacement
- AnswerSimilarity - Use
SemanticSimilarityinstead
Deprecated (Will be removed in future releases)
These features still work but show deprecation warnings:
LangchainLLMWrapper- Usellm_factory()directlyLlamaIndexLLMWrapper- Usellm_factory()directly- Legacy prompt classes - Migrate to function-based prompts
single_turn_ascore()on legacy metrics - Use collections metrics withascore()
New Features in v0.4 (Reference)
v0.4 introduces several new capabilities beyond the migration requirements. While not necessary for migrating from v0.3, these features may be useful for your upgrade:
- GPT-5 and o-Series Support - Automatic constraint handling for latest OpenAI models
- Universal Provider Support - Single
llm_factory()works with all major providers (Anthropic, Google, Azure, etc.) - Function-Based Prompts - More flexible and composable prompt definitions
- Metric Decorators - Simplified custom metric creation with
@discrete_metric,@numeric_metric,@ranking_metric - MetricResult with Reasoning - Structured results with optional explanations
- Enhanced Metric Save/Load - Easy serialization of metric configurations
- Better Embeddings Support - Both sync and async embedding operations
For detailed information on new features, see the v0.4 Release Notes.
Custom Metrics Migration
If you were using removed metrics like AspectCritic or SimpleCriteria, v0.4 provides decorator-based alternatives to replace them. You can also use the new simplified metric system for other custom metrics:
Discrete Metrics (Categorical Outputs)
Before (v0.3) - AspectCritic:
from ragas.metrics import AspectCritic
metric = AspectCritic(name="clarity", allowed_values=["clear", "unclear"])
result = await metric.single_turn_ascore(sample)
After (v0.4) - @discrete_metric decorator:
from ragas.metrics import discrete_metric
@discrete_metric(name="clarity", allowed_values=["clear", "unclear"])
def clarity(response: str) -> str:
return "clear" if len(response) > 50 else "unclear"
metric = clarity()
result = await metric.ascore(response="...")
print(result.value) # "clear" or "unclear"
Use discrete metrics for any categorical classification. All removed metrics (AspectCritic, SimpleCriteria) can be replaced this way.
Numeric Metrics (Continuous Values)
Use @numeric_metric for any scoring on a numerical scale:
from ragas.metrics import numeric_metric
@numeric_metric(name="length_score", allowed_values=(0.0, 1.0))
def length_score(response: str) -> float:
return min(len(response) / 500, 1.0)
# Custom range
@numeric_metric(name="quality_score", allowed_values=(0.0, 10.0))
def quality_score(response: str) -> float:
return 7.5
metric = length_score()
result = await metric.ascore(response="...")
print(result.value) # float between 0 and 1
Ranking Metrics (Ordered Lists)
Use @ranking_metric to rank or order multiple items:
from ragas.metrics import ranking_metric
@ranking_metric(name="context_rank", allowed_values=5)
def context_ranking(question: str, contexts: list[str]) -> list[str]:
"""Rank contexts by relevance."""
scored = [(len(set(question.split()) & set(c.split())), c) for c in contexts]
return [c for _, c in sorted(scored, reverse=True)]
metric = context_ranking()
result = await metric.ascore(question="...", contexts=[...])
print(result.value) # Ranked list
Summary
These decorators provide automatic validation, type safety, error handling, and result wrapping - reducing custom metric code from 50+ lines in v0.3 to just 5-10 lines in v0.4.
Common Issues and Solutions
Issue: ImportError for instructor_llm_factory
Error:
Solution:
# Instead of this
from ragas.llms import instructor_llm_factory
# Use this
from ragas.llms import llm_factory
Issue: Metric Returns MetricResult Instead of Float
Error:
Solution:
result = await metric.ascore(...)
score = result.value # Access the float value
print(score) # Prints: 0.85
Issue: SingleTurnSample Missing ground_truths
Error:
Solution:
# Change from
sample = SingleTurnSample(..., ground_truths=["correct"])
# To
sample = SingleTurnSample(..., reference="correct")
Getting Help
If you encounter issues during migration:
-
Check the Documentation
-
GitHub Issues
- Search existing issues
- Create a new issue with migration-specific details
-
Community Support
- Join our Discord community
- Schedule a call with the maintainers
Summary
v0.4 represents a fundamental shift towards experiment-based architecture, enabling better integration of evaluation, analysis, and iteration workflows. While there are breaking changes, they all serve the goal of making Ragas a better experimentation platform.
The migration path is straightforward:
- Update LLM initialization to use
llm_factory() - Import metrics from
ragas.metrics.collections - Replace
single_turn_ascore()withascore() - Rename
ground_truthstoreference - Handle
MetricResultobjects instead of floats
These technical changes enable:
- Better Experimentation - Structured metric results with reasoning for deeper analysis
- Cleaner API - Keyword arguments instead of sample objects make composition easier
- Integrated Workflows - Metrics designed to work seamlessly within experiment pipelines
- Enhanced Functionality - Universal provider support and automatic constraints
- Future-proof - Built on industry standards (instructor library, standardized patterns)
The experiment-based architecture will continue to improve in future releases, with more features for managing, analyzing, and iterating on your evaluations.
Good luck with your migration! We're here to help if you get stuck. 🎉