SQL

Execution based metrics

In these metrics the resulting SQL is compared after executing the SQL query on the database and then comparing the response with the expected results.

DataCompy Score

DataCompyScore metric uses DataCompy, a python library that compares two pandas DataFrames. It provides a simple interface to compare two DataFrames and provides a detailed report of the differences. In this metric the response is executed on the database and the resulting data is compared with the expected data, i.e. reference. To enable comparison both response and reference should be in the form of a Comma-Separated Values as shown in the example.

DataFrames can be compared across rows or columns. This can be configured using mode parameter.

If mode is row then the comparison is done row-wise. If mode is column then the comparison is done column-wise.

\[ \text{Precision } = {|\text{Number of matching rows in response and reference}| \over |\text{Total number of rows in response}|} \]

\[ \text{Recall } = {|\text{Number of matching rows in response and reference}| \over |\text{Total number of rows in reference}|} \]

By default, the mode is set to row, and metric is F1 score which is the harmonic mean of precision and recall.

from ragas.metrics.collections import DataCompyScore

data1 = """acct_id,dollar_amt,name,float_fld,date_fld
10000001234,123.45,George Maharis,14530.1555,2017-01-01
10000001235,0.45,Michael Bluth,1,2017-01-01
10000001236,1345,George Bluth,,2017-01-01
10000001237,123456,Bob Loblaw,345.12,2017-01-01
10000001238,1.05,Lucille Bluth,,2017-01-01
10000001238,1.05,Loose Seal Bluth,,2017-01-01
"""

data2 = """acct_id,dollar_amt,name,float_fld
10000001234,123.4,George Michael Bluth,14530.155
10000001235,0.45,Michael Bluth,
10000001236,1345,George Bluth,1
10000001237,123456,Robert Loblaw,345.12
10000001238,1.05,Loose Seal Bluth,111
"""

metric = DataCompyScore()
result = await metric.ascore(response=data1, reference=data2)
print(f"F1 Score: {result.value}")
print(f"Details: {result.reason}")

To change the mode to column-wise comparison, set the mode parameter to column.

metric = DataCompyScore(mode="columns", metric="recall")
result = await metric.ascore(response=data1, reference=data2)

DataCompyScore (Legacy)

Deprecated

DataCompyScore from ragas.metrics is deprecated and will be removed in a future version. Please use DataCompyScore from ragas.metrics.collections as shown above.

The legacy DataCompyScore uses the SingleTurnSample schema:

from ragas.metrics import DataCompyScore
from ragas.dataset_schema import SingleTurnSample

data1 = """acct_id,dollar_amt,name,float_fld,date_fld
10000001234,123.45,George Maharis,14530.1555,2017-01-01
10000001235,0.45,Michael Bluth,1,2017-01-01
10000001236,1345,George Bluth,,2017-01-01
10000001237,123456,Bob Loblaw,345.12,2017-01-01
10000001238,1.05,Lucille Bluth,,2017-01-01
10000001238,1.05,Loose Seal Bluth,,2017-01-01
"""

data2 = """acct_id,dollar_amt,name,float_fld
10000001234,123.4,George Michael Bluth,14530.155
10000001235,0.45,Michael Bluth,
10000001236,1345,George Bluth,1
10000001237,123456,Robert Loblaw,345.12
10000001238,1.05,Loose Seal Bluth,111
"""
sample = SingleTurnSample(response=data1, reference=data2)
scorer = DataCompyScore()
await scorer.single_turn_ascore(sample)

To change the mode to column-wise comparison, set the mode parameter to column.

scorer = DataCompyScore(mode="column", metric="recall")

Non Execution based metrics

Executing SQL queries on the database can be time-consuming and sometimes not feasible. In such cases, we can use non-execution based metrics to evaluate the SQL queries. These metrics compare the SQL queries directly without executing them on the database.

SQL Semantic Equivalence

SQLSemanticEquivalence is a metric that evaluates whether a generated SQL query is semantically equivalent to a reference query. The metric uses an LLM to analyze both queries in the context of the provided database schema and determine if they would produce the same results.

This is a binary metric: - 1.0: The SQL queries are semantically equivalent - 0.0: The SQL queries are not equivalent

The metric considers the database schema context to make accurate equivalence judgments, accounting for syntactic differences that don't affect semantics (e.g., active = 1 vs active = true).

from openai import AsyncOpenAI
from ragas.llms.base import llm_factory
from ragas.metrics.collections import SQLSemanticEquivalence

# Initialize the LLM
client = AsyncOpenAI()
llm = llm_factory("gpt-4o-mini", client=client)

# Create the metric
metric = SQLSemanticEquivalence(llm=llm)

# Evaluate SQL equivalence
result = await metric.ascore(
    response="""
        SELECT p.product_name, SUM(oi.quantity) AS total_quantity
        FROM order_items oi
        JOIN products p ON oi.product_id = p.product_id
        GROUP BY p.product_name;
    """,
    reference="""
        SELECT products.product_name, SUM(order_items.quantity) AS total_quantity
        FROM order_items
        INNER JOIN products ON order_items.product_id = products.product_id
        GROUP BY products.product_name;
    """,
    reference_contexts=[
        """
        Table order_items:
        - order_item_id: INT
        - order_id: INT
        - product_id: INT
        - quantity: INT
        """,
        """
        Table products:
        - product_id: INT
        - product_name: VARCHAR
        - price: DECIMAL
        """
    ]
)

print(f"Equivalent: {result.value == 1.0}")
print(f"Explanation: {result.reason}")

The result includes explanations of both queries and the reasoning for the equivalence determination.

LLMSQLEquivalence (Legacy)

Deprecated

LLMSQLEquivalence is deprecated and will be removed in a future version. Please use SQLSemanticEquivalence from ragas.metrics.collections as shown above.

LLMSQLEquivalence is the legacy metric for SQL semantic equivalence evaluation. It uses the SingleTurnSample schema and requires setting the LLM separately.

from ragas.metrics import LLMSQLEquivalence
from ragas.dataset_schema import SingleTurnSample

sample = SingleTurnSample(
    response="""
        SELECT p.product_name, SUM(oi.quantity) AS total_quantity
        FROM order_items oi
        JOIN products p ON oi.product_id = p.product_id
        GROUP BY p.product_name;
    """,
    reference="""
        SELECT p.product_name, COUNT(oi.quantity) AS total_quantity
        FROM order_items oi
        JOIN products p ON oi.product_id = p.product_id
        GROUP BY p.product_name;
    """,
    reference_contexts=[
        """
        Table order_items:
        - order_item_id: INT
        - order_id: INT
        - product_id: INT
        - quantity: INT
        """,
        """
        Table products:
        - product_id: INT
        - product_name: VARCHAR
        - price: DECIMAL
        """
    ]
)

scorer = LLMSQLEquivalence()
scorer.llm = openai_model
await scorer.single_turn_ascore(sample)