Benchmarking Gemini Models using Ragas
In this tutorial, we'll benchmark Gemini models on the AllenAI's QASPER dataset using Ragas metrics for the Academic Question Answering task.
About the Dataset
QASPER (Question Answering over Scientific Papers) is a dataset consisting of 5,049 questions based on 1,585 NLP research papers. Annotators created these questions from titles and abstracts, a different set of annotators answered them from the full paper texts.
Data Collection Process:
- Paper Selection: NLP domain papers from arXiv (LaTeX format) were selected from the S2ORC corpus.
- Question Writing: Annotators wrote realistic, information-seeking questions based only on paper titles and abstracts.
- Answer Annotation: Different annotators reviewed the entire paper to identify answers, selecting minimal relevant evidence (texts, tables, figures).
Link to the Dataset and further details about QASPER can be found here.
Loading Dataset
For demonstration purposes, we'll use a subset of 10 examples from the validation split:
from datasets import load_dataset
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
dataset = load_dataset("allenai/qasper", split="validation[:10]")
dataset
Dataset({
features: ['id', 'title', 'abstract', 'full_text', 'qas', 'figures_and_tables'],
num_rows: 10
})
Processing Dataset
Since our goal is to benchmark the modelβs performance on academic question-answering tasks, we need answers generated by LLMs based on the entire text of each research paper. We extract the full text from the datasetβs "full_text" column and format it into markdown, clearly organizing it into sections and paragraphs for improved readability and context.
To create question-answer pairs for evaluation, we use the datasetβs "qas" column. This column provides questions paired with answers in one of three formats: extractive spans, yes/no responses, or free-form answers. We then combine these into a single "golden answer" column, which serves as the ground truth for assessing model performance.
def convert_full_text_to_markdown(full_text_dict):
"""
Converts a full_text dictionary into a markdown-formatted string.
Expected keys:
- "section_name": list of section titles.
- "paragraphs": list of lists of paragraphs corresponding to each section.
Each section becomes a markdown header (##) followed by its paragraphs.
"""
sections = full_text_dict.get("section_name", [])
paragraphs = full_text_dict.get("paragraphs", [])
markdown_lines = []
for section, paragraph in zip(sections, paragraphs):
markdown_lines.append(f"## {section}")
markdown_lines.append("") # Blank line
markdown_lines.append("\n".join(map(str, paragraph)))
markdown_lines.append("") # End of section
markdown_lines.append("") # Extra blank line for separation
return "\n".join(markdown_lines)
def combine_responses(row):
"""
Combines 'extractive_spans', 'yes_no', and 'free_form_answer'
into one single string. Skips components that are missing.
"""
responses = []
if pd.notna(row.get("extractive_spans")):
if isinstance(row["extractive_spans"], list):
responses.append(" ".join(map(str, row["extractive_spans"])))
else:
responses.append(str(row["extractive_spans"]))
if pd.notna(row.get("yes_no")):
responses.append(str(row["yes_no"]))
if pd.notna(row.get("free_form_answer")):
responses.append(str(row["free_form_answer"]))
return "\n".join(responses) if responses else np.nan
def preprocess_hf_dataset(hf_ds):
"""
Processes a HuggingFace dataset split into a cleaned Pandas DataFrame.
Steps:
1. For each sample, convert 'full_text' to a markdown string.
2. For every QA pair in the sample, extract the question and first answer.
3. Build lists for answers, questions, and full_text (duplicated per question).
4. Create a DataFrame from the collected data.
5. Clean columns by replacing empty lists/strings with NaN and joining lists.
6. Combine the answer components into a single 'golden response'.
The function uses nested tqdm progress bars for real-time feedback.
Returns:
pd.DataFrame: The preprocessed DataFrame.
"""
answers_list = [] # Stores the first answer for each question
questions_list = [] # Stores each question text
full_text_list = [] # Stores the formatted full text per QA pair
# Outer loop: iterate over samples with progress bar
for sample in tqdm(hf_ds, desc="Processing samples", unit="sample"):
# Convert full text once per sample
formatted_text = convert_full_text_to_markdown(sample["full_text"])
# Create a list of QA pairs
qa_pairs = list(zip(sample["qas"]["question"], sample["qas"]["answers"]))
# Inner loop: iterate over each QA pair with its own progress bar
for question, answer_set in tqdm(
qa_pairs, desc="Processing QAs", total=len(qa_pairs), leave=False, unit="qa"
):
answers_list.append(answer_set["answer"][0])
questions_list.append(question)
full_text_list.append(formatted_text)
# Create DataFrame from the collected data
df = pd.DataFrame(answers_list)
df["question"] = questions_list
df["full_text"] = full_text_list
# Data Cleaning: Replace empty lists/strings with NaN and join lists if needed
df["extractive_spans"] = df["extractive_spans"].apply(
lambda x: np.nan if isinstance(x, list) and len(x) == 0 else x
)
df["free_form_answer"] = df["free_form_answer"].apply(
lambda x: np.nan if isinstance(x, str) and x.strip() == "" else x
)
df["yes_no"] = df["yes_no"].apply(lambda x: np.nan if x is None else x)
df["extractive_spans"] = df["extractive_spans"].apply(
lambda x: "\n".join(x) if isinstance(x, list) else x
)
# Combine the answer components into a single 'golden response'
df["golden response"] = df.apply(lambda row: combine_responses(row), axis=1)
return df
unanswerable | extractive_spans | yes_no | free_form_answer | evidence | highlighted_evidence | question | full_text | golden response | |
---|---|---|---|---|---|---|---|---|---|
0 | False | BIBREF19\nBIBREF20 | NaN | NaN | [Table TABREF19 and TABREF26 report zero-shot ... | [We compare our approaches with related approa... | which multilingual approaches do they compare ... | ## Introduction\n\nAlthough Neural Machine Tra... | BIBREF19\nBIBREF20 |
1 | False | pivoting\npivoting$_{\rm m}$ | NaN | NaN | [Table TABREF19 and TABREF26 report zero-shot ... | [We compare our approaches with related approa... | what are the pivot-based baselines? | ## Introduction\n\nAlthough Neural Machine Tra... | pivoting\npivoting$_{\rm m}$ |
2 | False | Europarl\nMultiUN | NaN | NaN | [We evaluate our cross-lingual pre-training ba... | [We evaluate our cross-lingual pre-training ba... | which datasets did they experiment with? | ## Introduction\n\nAlthough Neural Machine Tra... | Europarl\nMultiUN |
3 | False | NaN | NaN | De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E... | [For MultiUN corpus, we use four languages: En... | [For MultiUN corpus, we use four languages: En... | what language pairs are explored? | ## Introduction\n\nAlthough Neural Machine Tra... | De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E... |
4 | False | Stanford NER\nspaCy 2.0 \nrecurrent model with... | NaN | NaN | [In this section we describe a number of exper... | [In this section we describe a number of exper... | what ner models were evaluated? | ## Introduction\n\nNamed entity recognition is... | Stanford NER\nspaCy 2.0 \nrecurrent model with... |
Generating Responses from Gemini Models
To generate responses using the Gemini model, weβll first need to instantiate the Google GenAI client. We will define a prompt template that will be used when generating responses.
import os
from google import genai
from dotenv import load_dotenv
load_dotenv()
client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))
qa_prompt = (
f"Context information is below.\n"
"---------------------\n"
"{context_str}\n"
"---------------------\n"
"Given the context information and not prior knowledge, "
"answer the query.\n"
"If you cannot find answer to the query, just say that it cannot be answered.\n"
"Query: {query_str}\n"
"Answer: "
)
Gemini 2.0 Falsh
Code for AsyncExecutor
# async_executor.py
from __future__ import annotations
import asyncio
import time
import logging
from typing import Callable, Any, List, Tuple
from dataclasses import dataclass, field
import nest_asyncio
from tqdm import tqdm
# Apply nest_asyncio to allow nested event loops (e.g., in Jupyter)
nest_asyncio.apply()
logger = logging.getLogger(__name__)
def is_event_loop_running() -> bool:
try:
loop = asyncio.get_running_loop()
except RuntimeError:
return False
else:
return loop.is_running()
class RateLimiter:
"""
An asynchronous rate limiter that enforces a minimum interval between calls.
For example, with max_calls_per_minute=1250, it ensures that calls are spaced by ~0.048 seconds.
"""
def __init__(self, max_calls_per_minute: int):
self.interval = 60.0 / max_calls_per_minute
self.last_call = 0.0
self.lock = asyncio.Lock()
async def acquire(self):
async with self.lock:
now = time.monotonic()
elapsed = now - self.last_call
wait_time = self.interval - elapsed
if wait_time > 0:
await asyncio.sleep(wait_time)
self.last_call = time.monotonic()
@dataclass
class AsyncExecutor:
"""
An asynchronous executor similar in usage to the one in the evaluate function.
Attributes:
desc: Description for the progress bar.
show_progress: Whether to display a progress bar.
raise_exceptions: Whether to propagate exceptions.
max_calls_per_minute: API rate limit to enforce.
"""
desc: str = "Evaluating"
show_progress: bool = True
raise_exceptions: bool = False
max_calls_per_minute: int = 1250
jobs: List[Tuple[Callable[..., Any], tuple, dict, int]] = field(
default_factory=list, repr=False
)
job_counter: int = 0
rate_limiter: RateLimiter = field(init=False)
def __post_init__(self):
self.rate_limiter = RateLimiter(self.max_calls_per_minute)
def wrap_callable_with_index(
self, func: Callable[..., Any], index: int
) -> Callable[..., Any]:
"""
Wraps an asynchronous callable so that it enforces the rate limit,
and if an error occurs, it waits for an increasing delay (fallback)
before retrying the function call indefinitely.
"""
async def wrapped(*args, **kwargs) -> Tuple[int, Any]:
retry_delay = 10 # initial delay in seconds
while True:
try:
# Enforce the API rate limit before executing the function
await self.rate_limiter.acquire()
result = await func(*args, **kwargs)
return index, result
except Exception as e:
if self.raise_exceptions:
raise e
else:
logger.error(
"Error in job %d: %s. Retrying in %d seconds...",
index, e, retry_delay
)
# Wait asynchronously before retrying
await asyncio.sleep(retry_delay)
retry_delay += 5 # Increase delay for subsequent retries
return wrapped
def submit(self, func: Callable[..., Any], *args, **kwargs):
"""
Submit an asynchronous job to the executor.
"""
wrapped_func = self.wrap_callable_with_index(func, self.job_counter)
self.jobs.append((wrapped_func, args, kwargs, self.job_counter))
self.job_counter += 1
async def _run_jobs(self) -> List[Any]:
tasks = []
# Create asyncio tasks for each job
for wrapped_func, args, kwargs, index in self.jobs:
tasks.append(asyncio.create_task(wrapped_func(*args, **kwargs)))
results = [None] * len(tasks)
if self.show_progress:
pbar = tqdm(total=len(tasks), desc=self.desc)
for completed in asyncio.as_completed(tasks):
index, result = await completed
results[index] = result
pbar.update(1)
pbar.close()
else:
for completed in asyncio.as_completed(tasks):
index, result = await completed
results[index] = result
return results
def results(self) -> List[Any]:
"""
Execute all submitted asynchronous jobs and return their results
in the order they were submitted.
Thanks to nest_asyncio, this method can be used inside a Jupyter Notebook.
"""
# If an event loop is already running, nest_asyncio allows asyncio.run() to work.
return asyncio.run(self._run_jobs())
from async_executor import AsyncExecutor
async def query_gemini_2(query_str: str, context_str: str):
formatted_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)
response = await client.aio.models.generate_content(
model="gemini-2.0-flash", contents=formatted_prompt
)
return response.text
# Create an instance of the asynchronous executor
executor = AsyncExecutor(
desc="LLM Processing",
show_progress=True,
raise_exceptions=False,
)
for idx in range(processed_dataset.shape[0]):
query = processed_dataset.iloc[idx]["question"]
context = processed_dataset.iloc[idx]["full_text"]
executor.submit(query_gemini_2, query, context)
processed_dataset["gemini_2_flash_responses"] = executor.results()
Gemini 1.5 Falsh
from async_executor import AsyncExecutor
async def query_gemini_1_5(query_str: str, context_str: str):
formatted_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)
response = await client.aio.models.generate_content(
model="gemini-1.5-flash", contents=formatted_prompt
)
return response.text
# Create a new instance of the asynchronous executor
executor = AsyncExecutor(
desc="LLM Processing",
show_progress=True,
raise_exceptions=False,
)
for idx in range(processed_dataset.shape[0]):
query = processed_dataset.iloc[idx]["question"]
context = processed_dataset.iloc[idx]["full_text"]
executor.submit(query_gemini_1_5, query, context)
processed_dataset["gemini_1_5_flash_responses"] = executor.results()
unanswerable | extractive_spans | yes_no | free_form_answer | evidence | highlighted_evidence | question | full_text | golden response | gemini_2_flash_responses | gemini_1_5_flash_responses | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | BIBREF19\nBIBREF20 | NaN | NaN | [Table TABREF19 and TABREF26 report zero-shot ... | [We compare our approaches with related approa... | which multilingual approaches do they compare ... | ## Introduction\n\nAlthough Neural Machine Tra... | BIBREF19\nBIBREF20 | The text mentions comparison with Multilingual... | The paper compares its approach with multiling... |
1 | False | pivoting\npivoting$_{\rm m}$ | NaN | NaN | [Table TABREF19 and TABREF26 report zero-shot ... | [We compare our approaches with related approa... | what are the pivot-based baselines? | ## Introduction\n\nAlthough Neural Machine Tra... | pivoting\npivoting$_{\rm m}$ | The pivot-based baselines are pivoting and piv... | The provided text mentions two types of pivot-... |
2 | False | Europarl\nMultiUN | NaN | NaN | [We evaluate our cross-lingual pre-training ba... | [We evaluate our cross-lingual pre-training ba... | which datasets did they experiment with? | ## Introduction\n\nAlthough Neural Machine Tra... | Europarl\nMultiUN | They experimented with the Europarl and MultiU... | The experiments used two public datasets: Euro... |
3 | False | NaN | NaN | De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E... | [For MultiUN corpus, we use four languages: En... | [For MultiUN corpus, we use four languages: En... | what language pairs are explored? | ## Introduction\n\nAlthough Neural Machine Tra... | De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E... | The language pairs explored in this paper are:... | The paper explores the following language pair... |
4 | False | Stanford NER\nspaCy 2.0 \nrecurrent model with... | NaN | NaN | [In this section we describe a number of exper... | [In this section we describe a number of exper... | what ner models were evaluated? | ## Introduction\n\nNamed entity recognition is... | Stanford NER\nspaCy 2.0 \nrecurrent model with... | Based on the provided text, the following NER ... | Stanford NER, spaCy 2.0, and a recurrent model... |
Defining Metrics For Evaluation
We are benchmarking a question-answering task and we want to ensure that each question is answered properly and accurately. To achieve this, we use the following metrics from Ragas you find the complete list of metrics available in Ragas here
- Answer Accuracy: Measures how closely a response matches the reference answer.
- Answer Correctness: Assesses the alignment between the generated answer and the reference answer.
- Factual Correctness: Checks if all statements in a response are supported by the reference answer.
For each question, we know whether it can be answered from the provided context, and we want to see if the model can correctly identify it or not. For this purpose, we define a custom binary metric using AspectCritique.
from ragas.metrics import AnswerAccuracy, AnswerCorrectness, FactualCorrectness, AspectCritic
import getpass
import os
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
aspect_critic = AspectCritic(
name="unanswerable",
definition="Return 1 if the query cannot be answered by the provided context, otherwise return 0.",
llm=evaluator_llm,
)
metrics = [
AnswerAccuracy(llm=evaluator_llm),
AnswerCorrectness(llm=evaluator_llm, weights=[1, 0]),
aspect_critic,
FactualCorrectness(llm=evaluator_llm),
]
Benchmarking on Ragas Metrics
We format the processed data into a Ragas EvaluationDataset, then apply the metrics to evaluate model performance, more information on it can be found here. Weβll construct the EvaluationDataset using the questions and the golden answer responses generated by the Gemini models from our processed dataset.
Gemini 2.0 Falsh
We'll create EvaluationDataset for the Gemini 2.0 Flash.
from ragas.dataset_schema import EvaluationDataset
dataset_list = []
for i in range(processed_dataset.shape[0]):
sample = {
"user_input": (
"" if pd.isna(processed_dataset.iloc[i].get("question")) else processed_dataset.iloc[i].get("question")
),
"reference": (
""
if pd.isna(processed_dataset.iloc[i].get("golden response"))
else processed_dataset.iloc[i].get("golden response")
),
"response": (
""
if pd.isna(processed_dataset["gemini_2_flash_responses"].iloc[i])
else processed_dataset["gemini_2_flash_responses"].iloc[i]
),
}
dataset_list.append(sample)
gemini_2_dataset = EvaluationDataset.from_list(dataset_list)
gemini_2_dataset.to_pandas().head()
user_input | response | reference | |
---|---|---|---|
0 | which multilingual approaches do they compare ... | The text mentions comparison with Multilingual... | BIBREF19\nBIBREF20 |
1 | what are the pivot-based baselines? | The pivot-based baselines are pivoting and piv... | pivoting\npivoting$_{\rm m}$ |
2 | which datasets did they experiment with? | They experimented with the Europarl and MultiU... | Europarl\nMultiUN |
3 | what language pairs are explored? | The language pairs explored in this paper are:... | De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E... |
4 | what ner models were evaluated? | Based on the provided text, the following NER ... | Stanford NER\nspaCy 2.0 \nrecurrent model with... |
Now, letβs evaluate the responses of Gemini 2.0 Falsh.
from ragas import evaluate
gemini_2_flash_score = evaluate(dataset=gemini_2_dataset, metrics=metrics)
gemini_2_flash_score.to_pandas().head()
user_input | response | reference | nv_accuracy | answer_correctness | unanswerable | factual_correctness(mode=f1) | |
---|---|---|---|---|---|---|---|
0 | which multilingual approaches do they compare ... | The text mentions comparison with Multilingual... | BIBREF19\nBIBREF20 | 0.25 | 0.400000 | 0 | 0.5 |
1 | what are the pivot-based baselines? | The pivot-based baselines are pivoting and piv... | pivoting\npivoting$_{\rm m}$ | 0.25 | 0.000000 | 0 | 0.0 |
2 | which datasets did they experiment with? | They experimented with the Europarl and MultiU... | Europarl\nMultiUN | 1.00 | 1.000000 | 0 | 0.0 |
3 | what language pairs are explored? | The language pairs explored in this paper are:... | De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E... | 0.25 | 0.545455 | 0 | 0.0 |
4 | what ner models were evaluated? | Based on the provided text, the following NER ... | Stanford NER\nspaCy 2.0 \nrecurrent model with... | 0.50 | 0.600000 | 0 | 0.0 |
A completely optional step, if you want to upload the evaluation results to your Ragas app, you can run the command below.You can learn more about Ragas app here.
Gemini 1.5 Flash
Next, weβll follow similar steps for Gemini 1.5 Flash.
from ragas.dataset_schema import EvaluationDataset
dataset_list = []
for i in range(processed_dataset.shape[0]):
sample = {
"user_input": (
"" if pd.isna(processed_dataset.iloc[i].get("question")) else processed_dataset.iloc[i].get("question")
),
"reference": (
""
if pd.isna(processed_dataset.iloc[i].get("golden response"))
else processed_dataset.iloc[i].get("golden response")
),
"response": (
""
if pd.isna(processed_dataset["gemini_1_5_flash_responses"].iloc[i])
else processed_dataset["gemini_1_5_flash_responses"].iloc[i]
),
}
dataset_list.append(sample)
gemini_1_5_dataset = EvaluationDataset.from_list(dataset_list)
gemini_1_5_dataset.to_pandas().head()
user_input | response | reference | |
---|---|---|---|
0 | which multilingual approaches do they compare ... | The paper compares its approach with multiling... | BIBREF19\nBIBREF20 |
1 | what are the pivot-based baselines? | The provided text mentions two types of pivot-... | pivoting\npivoting$_{\rm m}$ |
2 | which datasets did they experiment with? | The experiments used two public datasets: Euro... | Europarl\nMultiUN |
3 | what language pairs are explored? | The paper explores the following language pair... | De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E... |
4 | what ner models were evaluated? | Stanford NER, spaCy 2.0, and a recurrent model... | Stanford NER\nspaCy 2.0 \nrecurrent model with... |
from ragas import evaluate
gemini_1_5_flash_score = evaluate(dataset=gemini_1_5_dataset, metrics=metrics)
gemini_1_5_flash_score.to_pandas().head()
user_input | response | reference | nv_accuracy | answer_correctness | unanswerable | factual_correctness(mode=f1) | |
---|---|---|---|---|---|---|---|
0 | which multilingual approaches do they compare ... | The paper compares its approach with multiling... | BIBREF19\nBIBREF20 | 0.25 | 0.400000 | 0 | 0.00 |
1 | what are the pivot-based baselines? | The provided text mentions two types of pivot-... | pivoting\npivoting$_{\rm m}$ | 0.25 | 0.181818 | 0 | 0.18 |
2 | which datasets did they experiment with? | The experiments used two public datasets: Euro... | Europarl\nMultiUN | 1.00 | 0.800000 | 0 | 0.00 |
3 | what language pairs are explored? | The paper explores the following language pair... | De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E... | 0.00 | 0.533333 | 0 | 0.00 |
4 | what ner models were evaluated? | Stanford NER, spaCy 2.0, and a recurrent model... | Stanford NER\nspaCy 2.0 \nrecurrent model with... | 0.50 | 0.571429 | 0 | 0.00 |
Comparing the Results
Now that we have completed our evaluations, letβs compare how both models performed on acadmic question answering.
def print__results(result):
result = result._repr_dict
print("Response Accuracy:", result.get("nv_accuracy"))
print("Answer Correctness:", result.get("answer_correctness"))
print("Factual Correctness:", result.get("factual_correctness(mode=f1)"))
print__results(gemini_1_5_flash_score)
Response Accuracy: 0.5416666666666666
Answer Correctness: 0.47723550201811066
Factual Correctness: 0.2533333333333333
Response Accuracy: 0.5666666666666667
Answer Correctness: 0.48055486996663466
Factual Correctness: 0.23633333333333334
Gemini 2.0 Flash performs slightly better overall.
Letβs see how well the models performed on classifying if a given question can be answered with the provided text.
For this, weβll use the result from the βunanswerableβ metric and compare it with the original ground truth from the βunanswerableβ column in our pre-processed dataset.
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
def print_metrics(actuals, preds, model_name="Model", zero_division_value=0):
"""
Prints common classification metrics for a given set of actual and predicted values.
Parameters:
actuals (array-like): Ground truth labels.
preds (array-like): Predicted labels.
model_name (str): Name of the model for display purposes.
zero_division_value (int or str): Sets the value to return when there is a zero division.
Options: 0, 1, or "warn" (default is 0 here).
"""
print(f"Metrics for {model_name}:")
print("Accuracy:", accuracy_score(actuals, preds))
print(
"Precision:", precision_score(actuals, preds, zero_division=zero_division_value)
)
print("Recall:", recall_score(actuals, preds, zero_division=zero_division_value))
print("F1 Score:", f1_score(actuals, preds, zero_division=zero_division_value))
print("\nClassification Report:")
print(classification_report(actuals, preds, zero_division=zero_division_value))
gemini_1_5_flash_prediction = gemini_1_5_flash_score["unanswerable"]
gemini_2_flash_prediction = gemini_2_flash_score["unanswerable"]
groundtruth = processed_dataset["unanswerable"].astype(int)
print_metrics(groundtruth, gemini_2_flash_prediction, model_name="Gemini 2 Flash")
Output
Metrics for Gemini 2 Flash:
Accuracy: 0.9333333333333333
Precision: 0.5
Recall: 1.0
F1 Score: 0.6666666666666666
Classification Report:
precision recall f1-score support
0 1.00 0.93 0.96 28
1 0.50 1.00 0.67 2
accuracy 0.93 30
macro avg 0.75 0.96 0.81 30
weighted avg 0.97 0.93 0.94 30
Metrics for Gemini 1.5 Flash:
Accuracy: 0.9
Precision: 0.3333333333333333
Recall: 0.5
F1 Score: 0.4
Classification Report:
precision recall f1-score support
0 0.96 0.93 0.95 28
1 0.33 0.50 0.40 2
accuracy 0.90 30
macro avg 0.65 0.71 0.67 30
weighted avg 0.92 0.90 0.91 30
Gemini 2.0 Flash also outperforms Gemini 1.5 Flash in identifying unanswerable questions.
What's Next
You can benchmark your models on any dataset using Ragas metrics as long as the dataset is formatted to Ragas EvaluationDatase. Try benchmarking your models on a variety of established benchmarking datasets.
And many more.