Nvidia Metrics
Answer Accuracy
Answer Accuracy measures the agreement between a model’s response and a reference ground truth for a given question. This is done via two distinct "LLM-as-a-judge" prompts that each return a rating (0, 2, or 4). The metric converts these ratings into a [0,1] scale and then takes the average of the two scores from the judges. Higher scores indicate that the model’s answer closely matches the reference.
- 0 → The response is inaccurate or does not address the same question as the reference.
- 2 → The response partially align with the reference.
- 4 → The response exactly aligns with the reference.
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import AnswerAccuracy
sample = SingleTurnSample(
user_input="When was Einstein born?",
response="Albert Einstein was born in 1879.",
reference="Albert Einstein was born in 1879."
)
scorer = AnswerAccuracy(llm=evaluator_llm) # evaluator_llm wrapped with ragas LLM Wrapper
score = await scorer.single_turn_ascore(sample)
print(score)
How It’s Calculated
Step 1: The LLM generates ratings using two distinct templates to ensure robustness:
- Template 1: The LLM compares the response with the reference and rates it on a scale of 0, 2, or 4.
- Template 2: The LLM evaluates the same question again, but this time the roles of the response and the reference are swapped.
This dual-perspective approach guarantees a fair assessment of the answer's accuracy.
Step 2: If both ratings are valid, the final score is average of score1 and score2; otherwise, it takes the valid one.
Example Calculation:
- User Input: "When was Einstein born?"
- Response: "Albert Einstein was born in 1879."
- Reference: "Albert Einstein was born in 1879."
Assuming both templates return a rating of 4 (indicating an exact match), the conversion is as follows:
- A rating of 4 corresponds to 1 on the [0,1] scale.
- Averaging the two scores: (1 + 1) / 2 = 1.
Thus, the final Answer Accuracy score is 1.
Similar Ragas Metrics
-
Answer Correctness: This metric gauges the accuracy of the generated answer compared to the ground truth by considering both semantic and factual similarity.
-
Rubric Score: The Rubric-Based Criteria Scoring Metric allows evaluations based on user-defined rubrics, where each rubric outlines specific scoring criteria. The LLM assesses responses according to these customized descriptions, ensuring a consistent and objective evaluation process.
Comparison of Metrics
Answer Correctness vs. Answer Accuracy
- LLM Calls: Answer Correctness requires three LLM calls (two for decomposing the response and reference into standalone statements and one for classifying them), while Answer Accuracy uses two independent LLM judgments.
- Token Usage: Answer Correctness consumes lot more tokens due to its detailed breakdown and classification process.
- Explainability: Answer Correctness offers high explainability by providing detailed insights into factual correctness and semantic similarity, whereas Answer Accuracy provides a straightforward raw score.
- Robust Evaluation: Answer Accuracy ensures consistency through dual LLM evaluations, while Answer Correctness offers a holistic view by deeply assessing the quality of the response.
Answer Accuracy vs. Rubric Score
- LLM Calls: Answer Accuracy makes two calls (one per LLM judge), while Rubric Score requires only one.
- Token Usage: Answer Accuracy is minimal since it outputs just a score, whereas Rubric Score generates reasoning, increasing token consumption.
- Explainability: Answer Accuracy provides a raw score without justification, while Rubric Score offers reasoning with verdict.
- Efficiency: Answer Accuracy is lightweight and works very well with smaller models.
Context Relevance
Context Relevance evaluates whether the retrieved_contexts (chunks or passages) are pertinent to the user_input. This is done via two independent "LLM-as-a-judge" prompt calls that each rate the relevance on a scale of 0, 1, or 2. The ratings are then converted to a [0,1] scale and averaged to produce the final score. Higher scores indicate that the contexts are more closely aligned with the user's query.
- 0 → The retrieved contexts are not relevant to the user’s query at all.
- 1 → The contexts are partially relevant.
- 2 → The contexts are completely relevant.
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import ContextRelevance
sample = SingleTurnSample(
user_input="When and Where Albert Einstein was born?",
retrieved_contexts=[
"Albert Einstein was born March 14, 1879.",
"Albert Einstein was born at Ulm, in WĂĽrttemberg, Germany.",
]
)
scorer = ContextRelevance(llm=evaluator_llm)
score = await scorer.single_turn_ascore(sample)
print(score)
How It’s Calculated
Step 1: The LLM is prompted with two distinct templates (template_relevance1 and template_relevance2) to evaluate the relevance of the retrieved contexts concerning the user's query. Each prompt returns a relevance rating of 0, 1, or 2.
Step 2: Each rating is normalized to a [0,1] scale by dividing by 2. If both ratings are valid, the final score is the average of these normalized values; if only one is valid, that score is used.
Example Calculation:
- User Input: "When and Where Albert Einstein was born?"
- Retrieved Contexts:
- "Albert Einstein was born March 14, 1879."
- "Albert Einstein was born at Ulm, in WĂĽrttemberg, Germany."
In this example, the two retrieved contexts together fully address the user's query by providing both the birth date and location of Albert Einstein. Consequently, both prompts would rate the combined contexts as 2 (fully relevant). Normalizing each score yields 1.0 (2/2), and averaging the two results maintains the final Context Relevance score at 1.
Similar Ragas Metrics
-
Context Precision: It measures the proportion of retrieved contexts that are relevant to answering a user's query. It is computed as the mean precision@k across all retrieved chunks, indicating how accurately the retrieval system ranks relevant information.
-
Context Recall: It quantifies the extent to which the relevant information is successfully retrieved. It is calculated as the ratio of the number of relevant claims (or contexts) found in the retrieved results to the total number of relevant claims in the reference, ensuring that important information is not missed.
-
Rubric Score: The Rubric-Based Criteria Scoring Metric evaluates responses based on user-defined rubrics with customizable scoring criteria, ensuring consistent and objective assessments. The scoring scale is flexible to suit user needs.
Context Precision and Context Recall vs. Context Relevance
- LLM Calls: Context Precision and Context Recall each require one LLM call each, one verifies context usefulness to get reference (verdict "1" or "0") and one classifies each answer sentence as attributable (binary 'Yes' (1) or 'No' (0)) while Context Relevance uses two LLM calls for increased robustness.
- Token Usage: Context Precision and Context Recall consume lot more tokens, whereas Context Relevance is more token-efficient.
- Explainability: Context Precision and Context Recall offer high explainability with detailed reasoning, while Context Relevance provides a raw score without explanations.
- Robust Evaluation: Context Relevance delivers a more robust evaluation through dual LLM judgments compared to the single-call approach of Context Precision and Context Recall.
Response Groundedness
Response Groundedness measures how well a response is supported or "grounded" by the retrieved contexts. It assesses whether each claim in the response can be found, either wholly or partially, in the provided contexts.
- 0 → The response is not grounded in the context at all.
- 1 → The response is partially grounded.
- 2 → The response is fully grounded (every statement can be found or inferred from the retrieved context).
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import ResponseGroundedness
sample = SingleTurnSample(
response="Albert Einstein was born in 1879.",
retrieved_contexts=[
"Albert Einstein was born March 14, 1879.",
"Albert Einstein was born at Ulm, in WĂĽrttemberg, Germany.",
]
)
scorer = ResponseGroundedness(llm=evaluator_llm)
score = await scorer.single_turn_ascore(sample)
print(score)
How It’s Calculated
Step 1: The LLM is prompted with two distinct templates to evaluate the grounding of the response with respect to the retrieved contexts. Each prompt returns a grounding rating of 0, 1, or 2.
Step 2: Each rating is normalized to a [0,1] scale by dividing by 2 (i.e., 0 becomes 0.0, 1 becomes 0.5, and 2 becomes 1.0). If both ratings are valid, the final score is computed as the average of these normalized values; if only one is valid, that score is used.
Example Calculation:
- Response: "Albert Einstein was born in 1879."
- Retrieved Contexts:
- "Albert Einstein was born March 14, 1879."
- "Albert Einstein was born at Ulm, in WĂĽrttemberg, Germany."
In this example, the retrieved contexts provide both the birth date and location of Albert Einstein. Since the response's claim is supported by the context (even though the date is partially provided), both prompts would likely rate the grounding as 2 (fully grounded). Normalizing a score of 2 gives 1.0 (2/2), and averaging the two normalized ratings maintains the final Response Groundedness score at 1.
Similar Ragas Metrics
-
Faithfulness: This metric measures how factually consistent a response is with the retrieved context, ensuring that every claim in the response is supported by the provided information. The Faithfulness score ranges from 0 to 1, with higher scores indicating better consistency.
-
Rubric Score: This is a general-purpose metric that evaluates responses based on user-defined criteria and can be adapted to assess Answer Accuracy, Context Relevance or Response Groundedness by aligning the rubric with the requirements.
Comparison of Metrics
Faithfulness vs. Response Groundedness
- LLM Calls: Faithfulness requires two calls for detailed claim breakdown and verdict, while Response Groundedness uses two independent LLM judgments.
- Token Usage: Faithfulness consumes more tokens, whereas Response Groundedness is more token-efficient.
- Explainability: Faithfulness provides transparent, reasoning for each claim, while Response Groundedness provides a raw score.
- Robust Evaluation: Faithfulness incorporates user input for a comprehensive assessment, whereas Response Groundedness ensures consistency through dual LLM evaluations.