Tokenizers#
Megatron Core provides a unified tokenizer system with a Hugging Face-style API for configuration and loading.
Overview#
The MegatronTokenizer class uses the same entry points as many Hugging Face workflows for loading and managing tokenizers:
Automatic detection - Load tokenizer types without naming the backing library in code
Metadata-based configuration - Store tokenizer settings in JSON for reuse across runs
Hugging Face-compatible API -
.from_pretrained()-style loadingCustom tokenizer support - Extend with model-specific tokenization logic
Key Features#
Unified API#
Use the same API regardless of tokenizer backend (SentencePiece, Hugging Face, TikToken, and so on):
from megatron.core.tokenizers import MegatronTokenizer
tokenizer = MegatronTokenizer.from_pretrained("/path/to/tokenizer")
Tokenizer Metadata#
Configuration is stored in a JSON metadata file containing:
Tokenizer library (Hugging Face, SentencePiece, TikToken, and so on)
Chat templates
Custom tokenizer class
Special token configurations
Benefits
Set configuration once, reuse everywhere
No repeated CLI arguments
Share setups by copying the tokenizer directory
Automatic Library Detection#
The correct tokenizer implementation is selected automatically:
Avoids hard-coding
SentencePieceTokenizer,HuggingFaceTokenizer, and related class names in user codeLibrary type is read from metadata
Change tokenizer backends by updating metadata and paths
Basic Usage#
Creating Tokenizer Metadata#
Save tokenizer configuration for reuse:
from megatron.core.tokenizers import MegatronTokenizer
# Create metadata for a SentencePiece tokenizer
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
chat_template="{% for message in messages %}{{ message.content }}{% endfor %}",
)
The metadata is saved as tokenizer_metadata.json in the tokenizer directory.
Loading a Tokenizer#
Load from a directory with metadata:
from megatron.core.tokenizers import MegatronTokenizer
# Load with auto-detected configuration
tokenizer = MegatronTokenizer.from_pretrained("/path/to/tokenizer.model")
Loading with Custom Metadata Path#
If metadata is stored separately:
tokenizer = MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer.model",
metadata_path="/path/to/custom/metadata.json",
)
Loading with Inline Metadata#
Pass metadata as a dictionary:
tokenizer = MegatronTokenizer.from_pretrained(
tokenizer_path="GPT2BPETokenizer",
metadata_path={"library": "megatron"},
vocab_file="/path/to/vocab.txt",
)
Advanced Usage#
Custom Tokenizer Classes#
Create model-specific tokenization logic:
from megatron.core.tokenizers.text import MegatronTokenizerText
class CustomTokenizer(MegatronTokenizerText):
def encode(self, text):
# Custom encoding logic
return super().encode(text)
def decode(self, tokens):
# Custom decoding logic
return super().decode(tokens)
# Save metadata with custom class
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
tokenizer_class=CustomTokenizer,
)
TikToken Tokenizers#
Configure TikToken-based tokenizers:
tokenizer = MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer/model.json",
metadata_path={"library": "tiktoken"},
pattern="v2",
num_special_tokens=1000,
)
Null Tokenizer#
The Null tokenizer is a lightweight, zero-I/O tokenizer that requires no model files. It is useful in three scenarios:
Performance benchmarking with
--mock-datawhere real tokenization is unnecessary.Testing in functional tests and CI pipelines where tokenizer model files may not be available. The Null tokenizer removes the dependency on external files, making tests self-contained and portable.
Pretraining with pretokenized data where all data is already tokenized into
.bin/.idxfiles. In this case the tokenizer is only needed for metadata (vocab_size,eod,pad) — not for actual tokenization. Using the Null tokenizer avoids redundant filesystem access at scale, which is particularly beneficial on shared filesystems like Lustre where thousands of ranks would otherwise all load the same tokenizer files.
Properties derived from --vocab-size N:
vocab_size=N(the exact value passed)eod=N - 1(last token in the vocabulary)pad=0
tokenizer = MegatronTokenizer.from_pretrained(
metadata_path={"library": "null-text"},
vocab_size=131072,
)
Integration with Megatron-LM#
Using with Training Scripts#
The tokenizer system works with Megatron-LM training scripts:
# Null tokenizer for benchmarking with mock data
torchrun --nproc_per_node=8 pretrain_gpt.py \
--tokenizer-type NullTokenizer \
--vocab-size 131072 \
--mock-data \
...
# Null tokenizer for pretraining with pretokenized data (no tokenizer files needed)
torchrun --nproc_per_node=8 pretrain_gpt.py \
--tokenizer-type NullTokenizer \
--vocab-size 128256 \
--data-path /path/to/pretokenized_data \
...
# Hugging Face tokenizer with metadata
torchrun --nproc_per_node=8 pretrain_gpt.py \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model meta-llama/Meta-Llama-3-8B \
--tokenizer-metadata /path/to/metadata.json \
...
Auto-Generated Metadata#
If --tokenizer-metadata is not specified, a default metadata file is generated automatically based on the tokenizer type.
Supported Tokenizer Libraries#
The following table lists supported tokenizer backends:
Library |
Description |
Use Case |
|---|---|---|
Hugging Face |
Transformers tokenizers |
Most modern LLMs, such as LLaMA and Mistral |
SentencePiece |
Google’s tokenizer |
GPT-style models, custom vocabularies |
TikToken |
OpenAI’s tokenizer |
GPT-3.5/GPT-4 style tokenization |
Megatron |
Built-in tokenizers |
Legacy GPT-2 BPE |
Null |
Zero-I/O tokenizer |
Benchmarking, pretokenized data |
Common Tokenizer Types#
LLaMA / Mistral#
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/llama/tokenizer.model",
tokenizer_library="sentencepiece",
)
GPT-2#
MegatronTokenizer.write_metadata(
tokenizer_path="GPT2BPETokenizer",
tokenizer_library="megatron",
vocab_file="/path/to/gpt2-vocab.json",
merge_file="/path/to/gpt2-merges.txt",
)
Recommendations#
Save metadata - Create metadata once, then reuse across training runs
Prefer Hugging Face tokenizers - When the model ships one, it reduces integration work
Test tokenization - Verify encode and decode before long training jobs
Version control metadata - Track
tokenizer_metadata.jsonwith experiment configsShare tokenizer directories - Ship model files and metadata together for reproducibility
Next Steps#
Prepare data: Refer to Data Preparation for preprocessing with tokenizers
Train models: Refer to Training Examples
Supported models: Refer to Language Models for model-specific tokenizers