Tokenizers#
Megatron Core provides a unified tokenizer system with a HuggingFace-style API for easy tokenizer management and configuration.
Overview#
The MegatronTokenizer class offers a simple, familiar API for loading and managing tokenizers:
Automatic detection - Load any tokenizer type without specifying the library
Metadata-based configuration - Store tokenizer settings in JSON for easy reuse
HuggingFace-compatible API - Familiar
.from_pretrained()interfaceCustom tokenizer support - Extend with model-specific tokenization logic
Key Features#
Unified API#
Use the same API regardless of tokenizer backend (SentencePiece, HuggingFace, TikToken, etc.):
from megatron.core.tokenizers import MegatronTokenizer
tokenizer = MegatronTokenizer.from_pretrained("/path/to/tokenizer")
Tokenizer Metadata#
Configuration is stored in a JSON metadata file containing:
Tokenizer library (HuggingFace, SentencePiece, TikToken, etc.)
Chat templates
Custom tokenizer class
Special token configurations
Benefits:
Set configuration once, reuse everywhere
No repeated CLI arguments
Easy sharing - just copy the tokenizer directory
Automatic Library Detection#
The correct tokenizer implementation is automatically selected:
No need to specify
SentencePieceTokenizer,HuggingFaceTokenizer, etc.Library type detected from metadata
Seamless switching between tokenizer backends
Basic Usage#
Creating Tokenizer Metadata#
Save tokenizer configuration for reuse:
from megatron.core.tokenizers import MegatronTokenizer
# Create metadata for a SentencePiece tokenizer
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
chat_template="{% for message in messages %}{{ message.content }}{% endfor %}",
)
The metadata is saved as tokenizer_metadata.json in the tokenizer directory.
Loading a Tokenizer#
Load from a directory with metadata:
from megatron.core.tokenizers import MegatronTokenizer
# Load with auto-detected configuration
tokenizer = MegatronTokenizer.from_pretrained("/path/to/tokenizer.model")
Loading with Custom Metadata Path#
If metadata is stored separately:
tokenizer = MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer.model",
metadata_path="/path/to/custom/metadata.json",
)
Loading with Inline Metadata#
Pass metadata as a dictionary:
tokenizer = MegatronTokenizer.from_pretrained(
tokenizer_path="GPT2BPETokenizer",
metadata_path={"library": "megatron"},
vocab_file="/path/to/vocab.txt",
)
Advanced Usage#
Custom Tokenizer Classes#
Create model-specific tokenization logic:
from megatron.core.tokenizers.text import MegatronTokenizerText
class CustomTokenizer(MegatronTokenizerText):
def encode(self, text):
# Custom encoding logic
return super().encode(text)
def decode(self, tokens):
# Custom decoding logic
return super().decode(tokens)
# Save metadata with custom class
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
tokenizer_class=CustomTokenizer,
)
TikToken Tokenizers#
Configure TikToken-based tokenizers:
tokenizer = MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer/model.json",
metadata_path={"library": "tiktoken"},
pattern="v2",
num_special_tokens=1000,
)
Null Tokenizer#
Use a null tokenizer for testing or non-text models:
tokenizer = MegatronTokenizer.from_pretrained(
metadata_path={"library": "null"},
vocab_size=131072,
)
Integration with Megatron-LM#
Using with Training Scripts#
The tokenizer system integrates seamlessly with Megatron-LM training:
# Null tokenizer for testing
torchrun --nproc_per_node=8 pretrain_gpt.py \
--tokenizer-type NullTokenizer \
--vocab-size 131072 \
...
# HuggingFace tokenizer with metadata
torchrun --nproc_per_node=8 pretrain_gpt.py \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model meta-llama/Meta-Llama-3-8B \
--tokenizer-metadata /path/to/metadata.json \
...
Auto-Generated Metadata#
If --tokenizer-metadata is not specified, a default metadata file is generated automatically based on the tokenizer type.
Legacy Tokenizer Support#
The old tokenizer system is still supported for backward compatibility:
torchrun --nproc_per_node=8 pretrain_gpt.py \
--legacy-tokenizer \
...
Supported Tokenizer Libraries#
Library |
Description |
Use Case |
|---|---|---|
HuggingFace |
Transformers tokenizers |
Most modern LLMs (LLaMA, Mistral, etc.) |
SentencePiece |
Google’s tokenizer |
GPT-style models, custom vocabularies |
TikToken |
OpenAI’s tokenizer |
GPT-3.5/GPT-4 style tokenization |
Megatron |
Built-in tokenizers |
Legacy GPT-2 BPE |
Null |
No-op tokenizer |
Testing, non-text modalities |
Common Tokenizer Types#
LLaMA / Mistral#
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/llama/tokenizer.model",
tokenizer_library="sentencepiece",
)
GPT-2#
MegatronTokenizer.write_metadata(
tokenizer_path="GPT2BPETokenizer",
tokenizer_library="megatron",
vocab_file="/path/to/gpt2-vocab.json",
merge_file="/path/to/gpt2-merges.txt",
)
Best Practices#
Always save metadata - Create metadata once, reuse across training runs
Use HuggingFace tokenizers - When possible, for modern LLM compatibility
Test tokenization - Verify encode/decode before starting training
Version control metadata - Include
tokenizer_metadata.jsonin your experiment configsShare tokenizer directories - Include both model files and metadata for reproducibility
Next Steps#
Prepare Data: See Data Preparation for preprocessing with tokenizers
Train Models: Use tokenizers in Training Examples
Supported Models: Check Language Models for model-specific tokenizers