Adapting Metrics to Target Language

When evaluating LLM applications in languages other than English, adapt your metrics to the target language. Ragas uses an LLM to translate the few-shot examples in prompts.

Setup

from openai import AsyncOpenAI
from ragas.llms import llm_factory
from ragas.metrics.collections import Faithfulness

client = AsyncOpenAI()
llm = llm_factory("gpt-4o-mini", client=client)

metric = Faithfulness(llm=llm)

Adapt Prompts to Target Language

Collections metrics have prompts as direct attributes. Use the adapt() method to translate the few-shot examples:

# Check original language
print(metric.statement_generator_prompt.language)
# english

# Adapt prompts to Hindi
metric.statement_generator_prompt = await metric.statement_generator_prompt.adapt(
    target_language="hindi", llm=llm
)
metric.nli_statement_prompt = await metric.nli_statement_prompt.adapt(
    target_language="hindi", llm=llm
)

# Verify adaptation
print(metric.statement_generator_prompt.language)
# hindi

# See translated example
print(metric.statement_generator_prompt.examples[0][0].question)
# अल्बर्ट आइंस्टीन कौन थे और वे किस चीज़ के लिए सबसे अधिक जाने जाते हैं?

Note

By default, only few-shot examples are translated. Instructions remain in English. To also translate instructions, set adapt_instruction=True.

Evaluate with Adapted Metric

result = await metric.ascore(
    user_input="भारत की राजधानी क्या है?",
    response="भारत की राजधानी नई दिल्ली है।",
    retrieved_contexts=["भारत की राजधानी नई दिल्ली है, जो देश का सबसे बड़ा शहर भी है।"],
)

print(f"Faithfulness: {result.value}")
# Faithfulness: 1.0

Adapting Other Metrics

The same pattern works for any collections metric with prompts:

from ragas.metrics.collections import AnswerRelevancy
from ragas.embeddings.base import embedding_factory

embeddings = embedding_factory("openai", client=client)
relevancy = AnswerRelevancy(llm=llm, embeddings=embeddings)

# Adapt the prompt
relevancy.prompt = await relevancy.prompt.adapt(
    target_language="spanish", llm=llm
)

# See translated example
print(relevancy.prompt.examples[0][0].response)
# Albert Einstein nació en Alemania.

Adapting FactualCorrectness

FactualCorrectness has two prompts that both need to be adapted:

from ragas.metrics.collections import FactualCorrectness

metric = FactualCorrectness(llm=llm)

# Adapt both prompts to German
metric.prompt = await metric.prompt.adapt(
    target_language="german", llm=llm
)
metric.nli_prompt = await metric.nli_prompt.adapt(
    target_language="german", llm=llm
)

# Verify adaptation
print(metric.prompt.language)  # german
print(metric.nli_prompt.language)  # german

# Now use the adapted metric
result = await metric.ascore(
    response="Einstein wurde 1879 in Deutschland geboren.",
    reference="Albert Einstein wurde am 14. März 1879 in Ulm, Deutschland geboren."
)

print(f"Factual Correctness: {result.value}")

Tip

Like Faithfulness, FactualCorrectness uses two prompts internally: - prompt - ClaimDecompositionPrompt for breaking text into claims - nli_prompt - NLIStatementPrompt for verifying claims

Both prompts should be adapted when evaluating in non-English languages.