New Tokenizer System#

Key Differences from the Old Tokenizer System#

1. Hugging Face–style API#

We now have a MegatronTokenizer class that provides a familiar, simple API similar to Hugging Face’s:

.from_pretrained() – Load a tokenizer from a directory or file, automatically detecting the type and settings.

.write_metadata() – Save tokenizer configuration (metadata) so that it can be reused without re-specifying parameters.

This eliminates the need for long initialization arguments and hard-coded settings in training scripts.

2. Tokenizer Metadata#

A metadata file (JSON) now stores all essential tokenizer configuration in one place:

Tokenizer library (e.g., HuggingFace, SentencePiece, TikToken, etc.)
Chat templates
Tokenizer class

Benefits:

You only need to set these parameters once.
No more passing multiple CLI arguments for tokenizer settings.
Easy sharing — just copy the tokenizer directory with its metadata file.

3. Library Classes Are Now Internal#

In the old system, you had to know which tokenizer library to use (SentencePieceTokenizer, HuggingFaceTokenizer, etc.) and instantiate it manually.

In the new system:

The library is automatically detected from the metadata.
The correct tokenizer implementation is chosen under the hood.
Users don’t need to manually manage tokenizer classes.

3. Support for Model-specific Tokenizer Classes#

The system now supports:

Built-in LLM-specific tokenizers.
Custom tokenizers: You can create your own tokenizer class by inheriting from MegatronTokenizerText and specify it in the tokenizer_class field in the metadata file.
This allows advanced customization while keeping defaults simple for most users.

4. Usage#

Creating and Saving Metadata

from megatron.core.tokenizers import MegatronTokenizer

# The metadata will be stored as a file named tokenizer_metadata.json inside the tokenizer’s directory.
MegatronTokenizer.write_metadata(
    tokenizer_path="/path/to/tokenizer.model",
    tokenizer_library="sentencepiece",
    chat_template="chat template in jinja format",
)

# To use custom tokenizer class
from megatron.core.tokenizers.text import MegatronTokenizerText

class CustomTokenizer(MegatronTokenizerText):
    ...

MegatronTokenizer.write_metadata(
    tokenizer_path="/path/to/tokenizer.model",
    tokenizer_library="sentencepiece",
    chat_template="chat template in jinja format",
    tokenizer_class=CustomTokenizer,
)

# To save metadata to another dir
MegatronTokenizer.write_metadata(
    tokenizer_path="/path/to/tokenizer.model",
    tokenizer_library="sentencepiece",
    metadata_path="/path/to/save/metadata.json",
)

Restoring the tokenizer

from megatron.core.tokenizers import MegatronTokenizer

MegatronTokenizer.from_pretrained(
    tokenizer_path="/path/to/tokenizer.model",
)

# If metadata is not in tokenizer’s dir
MegatronTokenizer.from_pretrained(
    tokenizer_path="/path/to/tokenizer.model",
    metadata_path="/path/to/metadata.json",
)

# Pass metadata as dict
MegatronTokenizer.from_pretrained(
    tokenizer_path="GPT2BPETokenizer",
    metadata_path={"library": "megatron"},
    vocab_file="/path/to/vocab.txt",
)

# Pass additional params
MegatronTokenizer.from_pretrained(
    tokenizer_path="/path/to/tokenizer/model.json",
    metadata_path={"library": "tiktoken"},
    pattern="v2",
    num_special_tokens=1000,
)

# Null tokenzier
MegatronTokenizer.from_pretrained(
    metadata_path={"library": "null"},
    vocab_size=131072,
)

4. Megatron-LM pretraining compatibility#

New tokenizer system is compatible with megatron-lm pretrain script. If --tokenizer-metadata is not specified, a default metadata file will be generated automatically.

# Null tokenizer
torchrun --nproc_per_node=1 pretrain_gpt.py \
    ... \
    --tokenizer-type NullTokenizer \
    --vocab-size 131072

# HuggingFace tokenizer with specified metadata
torchrun --nproc_per_node=1 pretrain_gpt.py \
    ... \
    --tokenizer-type HuggingFaceTokenizer \
    --tokenizer-model meta-llama/Meta-Llama-3-8B \
    --tokenizer-metadata /path/to/metadata.json

The Megatron-LM pretraining script still supports the legacy tokenizer system. To enable it, simply add the --legacy-tokenizer flag.