New Tokenizer System#
Key Differences from the Old Tokenizer System#
1. Hugging Face–style API#
We now have a
MegatronTokenizer class that provides a familiar, simple API similar to Hugging Face’s:
.from_pretrained() – Load a tokenizer from a directory or file, automatically detecting the type and settings.
.write_metadata() – Save tokenizer configuration (metadata) so that it can be reused without re-specifying parameters.
This eliminates the need for long initialization arguments and hard-coded settings in training scripts.
2. Tokenizer Metadata#
A metadata file (JSON) now stores all essential tokenizer configuration in one place:
Tokenizer library (e.g., HuggingFace, SentencePiece, TikToken, etc.)
Chat templates
Tokenizer class
Benefits:
You only need to set these parameters once.
No more passing multiple CLI arguments for tokenizer settings.
Easy sharing — just copy the tokenizer directory with its metadata file.
3. Library Classes Are Now Internal#
In the old system, you had to know which tokenizer library to use (
SentencePieceTokenizer,
HuggingFaceTokenizer, etc.) and instantiate it manually.
In the new system:
The library is automatically detected from the metadata.
The correct tokenizer implementation is chosen under the hood.
Users don’t need to manually manage tokenizer classes.
3. Support for Model-specific Tokenizer Classes#
The system now supports:
Built-in LLM-specific tokenizers.
Custom tokenizers: You can create your own tokenizer class by inheriting from
MegatronTokenizerTextand specify it in the
tokenizer_classfield in the metadata file.
This allows advanced customization while keeping defaults simple for most users.
4. Usage#
Creating and Saving Metadata
from megatron.core.tokenizers import MegatronTokenizer
# The metadata will be stored as a file named tokenizer_metadata.json inside the tokenizer’s directory.
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
chat_template="chat template in jinja format",
)
# To use custom tokenizer class
from megatron.core.tokenizers.text import MegatronTokenizerText
class CustomTokenizer(MegatronTokenizerText):
...
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
chat_template="chat template in jinja format",
tokenizer_class=CustomTokenizer,
)
# To save metadata to another dir
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
metadata_path="/path/to/save/metadata.json",
)
Restoring the tokenizer
from megatron.core.tokenizers import MegatronTokenizer
MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer.model",
)
# If metadata is not in tokenizer’s dir
MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer.model",
metadata_path="/path/to/metadata.json",
)
# Pass metadata as dict
MegatronTokenizer.from_pretrained(
tokenizer_path="GPT2BPETokenizer",
metadata_path={"library": "megatron"},
vocab_file="/path/to/vocab.txt",
)
# Pass additional params
MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer/model.json",
metadata_path={"library": "tiktoken"},
pattern="v2",
num_special_tokens=1000,
)
# Null tokenzier
MegatronTokenizer.from_pretrained(
metadata_path={"library": "null"},
vocab_size=131072,
)
4. Megatron-LM pretraining compatibility#
New tokenizer system is compatible with megatron-lm pretrain script. If
--tokenizer-metadata is not specified, a default metadata file will be generated automatically.
# Null tokenizer
torchrun --nproc_per_node=1 pretrain_gpt.py \
... \
--tokenizer-type NullTokenizer \
--vocab-size 131072
# HuggingFace tokenizer with specified metadata
torchrun --nproc_per_node=1 pretrain_gpt.py \
... \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model meta-llama/Meta-Llama-3-8B \
--tokenizer-metadata /path/to/metadata.json
The Megatron-LM pretraining script still supports the legacy tokenizer system. To enable it, simply add the
--legacy-tokenizer flag.