Tokenizers
Ragas supports multiple tokenizer implementations for text splitting during knowledge graph operations and test data generation.
Overview
When extracting properties from knowledge graph nodes, text is split into chunks based on token limits. By default, Ragas uses tiktoken (OpenAI's tokenizer), but you can also use HuggingFace tokenizers for better compatibility with open-source models.
Available Tokenizers
TiktokenWrapper
Wrapper for OpenAI's tiktoken tokenizers. This is the default tokenizer.
from ragas import TiktokenWrapper
# Using default encoding (o200k_base)
tokenizer = TiktokenWrapper()
# Using a specific encoding
tokenizer = TiktokenWrapper(encoding_name="cl100k_base")
# Using encoding for a specific model
tokenizer = TiktokenWrapper(model_name="gpt-4")
HuggingFaceTokenizer
Wrapper for HuggingFace transformers tokenizers. Use this when working with open-source models.
from ragas import HuggingFaceTokenizer
# Load tokenizer for a specific model
tokenizer = HuggingFaceTokenizer(model_name="meta-llama/Llama-2-7b-hf")
# Use a pre-initialized tokenizer
from transformers import AutoTokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = HuggingFaceTokenizer(tokenizer=hf_tokenizer)
Note: HuggingFace tokenizers require the transformers package. Install it with:
Factory Function
Use get_tokenizer() for a simple way to create tokenizers:
from ragas import get_tokenizer
# Default tiktoken tokenizer
tokenizer = get_tokenizer()
# Tiktoken for a specific model
tokenizer = get_tokenizer("tiktoken", model_name="gpt-4")
# HuggingFace tokenizer
tokenizer = get_tokenizer("huggingface", model_name="meta-llama/Llama-2-7b-hf")
Using Custom Tokenizers
With LLM-based Extractors
All LLM-based extractors accept a tokenizer parameter:
from ragas import HuggingFaceTokenizer
from ragas.testset.transforms import (
SummaryExtractor,
KeyphrasesExtractor,
HeadlinesExtractor,
)
# Create a HuggingFace tokenizer for your model
tokenizer = HuggingFaceTokenizer(model_name="meta-llama/Llama-2-7b-hf")
# Use it with extractors
summary_extractor = SummaryExtractor(llm=your_llm, tokenizer=tokenizer)
keyphrase_extractor = KeyphrasesExtractor(llm=your_llm, tokenizer=tokenizer)
headlines_extractor = HeadlinesExtractor(llm=your_llm, tokenizer=tokenizer)
Custom Tokenizer Implementation
You can create your own tokenizer by extending BaseTokenizer:
from ragas.tokenizers import BaseTokenizer
class MyCustomTokenizer(BaseTokenizer):
def __init__(self, ...):
# Initialize your tokenizer
pass
def encode(self, text: str) -> list[int]:
# Return token IDs
pass
def decode(self, tokens: list[int]) -> str:
# Return decoded text
pass
API Reference
Tokenizer abstractions for Ragas.
This module provides a unified interface for different tokenizer implementations, supporting both tiktoken (OpenAI) and HuggingFace tokenizers.
BaseTokenizer
Bases: ABC
Abstract base class for tokenizers.
encode
abstractmethod
decode
abstractmethod
TiktokenWrapper
TiktokenWrapper(encoding: Optional[Encoding] = None, model_name: Optional[str] = None, encoding_name: Optional[str] = None)
Bases: BaseTokenizer
Wrapper for tiktoken encodings (OpenAI tokenizers).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
encoding
|
Encoding
|
A pre-initialized tiktoken encoding. |
None
|
model_name
|
str
|
Model name to get encoding for (e.g., "gpt-4", "gpt-3.5-turbo"). |
None
|
encoding_name
|
str
|
Encoding name (e.g., "cl100k_base", "o200k_base"). |
None
|
If
|
|
required |
Source code in src/ragas/tokenizers.py
HuggingFaceTokenizer
Bases: BaseTokenizer
Wrapper for HuggingFace tokenizers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer
|
PreTrainedTokenizer or PreTrainedTokenizerFast
|
A pre-initialized HuggingFace tokenizer. |
None
|
model_name
|
str
|
Model name or path to load tokenizer from (e.g., "meta-llama/Llama-2-7b"). |
None
|
One
|
|
required |
Source code in src/ragas/tokenizers.py
get_default_tokenizer
get_default_tokenizer() -> TiktokenWrapper
Get the default tokenizer, creating it lazily on first access.
Source code in src/ragas/tokenizers.py
get_tokenizer
get_tokenizer(tokenizer_type: str = 'tiktoken', model_name: Optional[str] = None, encoding_name: Optional[str] = None) -> BaseTokenizer
Factory function to get a tokenizer instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer_type
|
str
|
Type of tokenizer: "tiktoken" or "huggingface". |
'tiktoken'
|
model_name
|
str
|
Model name for the tokenizer. |
None
|
encoding_name
|
str
|
Encoding name (only for tiktoken). |
None
|
Returns:
| Type | Description |
|---|---|
BaseTokenizer
|
A tokenizer instance. |
Examples:
>>> # Get tiktoken for a specific model
>>> tokenizer = get_tokenizer("tiktoken", model_name="gpt-4")
>>> # Get HuggingFace tokenizer
>>> tokenizer = get_tokenizer("huggingface", model_name="meta-llama/Llama-2-7b")