Tokenizers#

Megatron Core provides a unified tokenizer system with a Hugging Face-style API for configuration and loading.

Overview#

The MegatronTokenizer class uses the same entry points as many Hugging Face workflows for loading and managing tokenizers:

Automatic detection - Load tokenizer types without naming the backing library in code
Metadata-based configuration - Store tokenizer settings in JSON for reuse across runs
Hugging Face-compatible API - .from_pretrained()-style loading
Custom tokenizer support - Extend with model-specific tokenization logic

Key Features#

Unified API#

Use the same API regardless of tokenizer backend (SentencePiece, Hugging Face, TikToken, and so on):

from megatron.core.tokenizers import MegatronTokenizer

tokenizer = MegatronTokenizer.from_pretrained("/path/to/tokenizer")

Tokenizer Metadata#

Configuration is stored in a JSON metadata file containing:

Tokenizer library (Hugging Face, SentencePiece, TikToken, and so on)
Chat templates
Custom tokenizer class
Special token configurations

Benefits

Set configuration once, reuse everywhere
No repeated CLI arguments
Share setups by copying the tokenizer directory

Automatic Library Detection#

The correct tokenizer implementation is selected automatically:

Avoids hard-coding SentencePieceTokenizer, HuggingFaceTokenizer, and related class names in user code
Library type is read from metadata
Change tokenizer backends by updating metadata and paths

Basic Usage#

Creating Tokenizer Metadata#

Save tokenizer configuration for reuse:

from megatron.core.tokenizers import MegatronTokenizer

# Create metadata for a SentencePiece tokenizer
MegatronTokenizer.write_metadata(
    tokenizer_path="/path/to/tokenizer.model",
    tokenizer_library="sentencepiece",
    chat_template="{% for message in messages %}{{ message.content }}{% endfor %}",
)

The metadata is saved as tokenizer_metadata.json in the tokenizer directory.

Loading a Tokenizer#

Load from a directory with metadata:

from megatron.core.tokenizers import MegatronTokenizer

# Load with auto-detected configuration
tokenizer = MegatronTokenizer.from_pretrained("/path/to/tokenizer.model")

Loading with Custom Metadata Path#

If metadata is stored separately:

tokenizer = MegatronTokenizer.from_pretrained(
    tokenizer_path="/path/to/tokenizer.model",
    metadata_path="/path/to/custom/metadata.json",
)

Loading with Inline Metadata#

Pass metadata as a dictionary:

tokenizer = MegatronTokenizer.from_pretrained(
    tokenizer_path="GPT2BPETokenizer",
    metadata_path={"library": "megatron"},
    vocab_file="/path/to/vocab.txt",
)

Advanced Usage#

Custom Tokenizer Classes#

Create model-specific tokenization logic:

from megatron.core.tokenizers.text import MegatronTokenizerText

class CustomTokenizer(MegatronTokenizerText):
    def encode(self, text):
        # Custom encoding logic
        return super().encode(text)

    def decode(self, tokens):
        # Custom decoding logic
        return super().decode(tokens)

# Save metadata with custom class
MegatronTokenizer.write_metadata(
    tokenizer_path="/path/to/tokenizer.model",
    tokenizer_library="sentencepiece",
    tokenizer_class=CustomTokenizer,
)

TikToken Tokenizers#

Configure TikToken-based tokenizers:

tokenizer = MegatronTokenizer.from_pretrained(
    tokenizer_path="/path/to/tokenizer/model.json",
    metadata_path={"library": "tiktoken"},
    pattern="v2",
    num_special_tokens=1000,
)

Null Tokenizer#

The Null tokenizer is a lightweight, zero-I/O tokenizer that requires no model files. It is useful in three scenarios:

Performance benchmarking with --mock-data where real tokenization is unnecessary.
Testing in functional tests and CI pipelines where tokenizer model files may not be available. The Null tokenizer removes the dependency on external files, making tests self-contained and portable.
Pretraining with pretokenized data where all data is already tokenized into .bin/.idx files. In this case the tokenizer is only needed for metadata (vocab_size, eod, pad) — not for actual tokenization. Using the Null tokenizer avoids redundant filesystem access at scale, which is particularly beneficial on shared filesystems like Lustre where thousands of ranks would otherwise all load the same tokenizer files.

Properties derived from --vocab-size N:

vocab_size = N (the exact value passed)
eod = N - 1 (last token in the vocabulary)
pad = 0

tokenizer = MegatronTokenizer.from_pretrained(
    metadata_path={"library": "null-text"},
    vocab_size=131072,
)

Integration with Megatron-LM#

Using with Training Scripts#

The tokenizer system works with Megatron-LM training scripts:

# Null tokenizer for benchmarking with mock data
torchrun --nproc_per_node=8 pretrain_gpt.py \
    --tokenizer-type NullTokenizer \
    --vocab-size 131072 \
    --mock-data \
    ...

# Null tokenizer for pretraining with pretokenized data (no tokenizer files needed)
torchrun --nproc_per_node=8 pretrain_gpt.py \
    --tokenizer-type NullTokenizer \
    --vocab-size 128256 \
    --data-path /path/to/pretokenized_data \
    ...

# Hugging Face tokenizer with metadata
torchrun --nproc_per_node=8 pretrain_gpt.py \
    --tokenizer-type HuggingFaceTokenizer \
    --tokenizer-model meta-llama/Meta-Llama-3-8B \
    --tokenizer-metadata /path/to/metadata.json \
    ...

Auto-Generated Metadata#

If --tokenizer-metadata is not specified, a default metadata file is generated automatically based on the tokenizer type.

Supported Tokenizer Libraries#

The following table lists supported tokenizer backends:

Library	Description	Use Case
Hugging Face	Transformers tokenizers	Most modern LLMs, such as LLaMA and Mistral
SentencePiece	Google’s tokenizer	GPT-style models, custom vocabularies
TikToken	OpenAI’s tokenizer	GPT-3.5/GPT-4 style tokenization
Megatron	Built-in tokenizers	Legacy GPT-2 BPE
Null	Zero-I/O tokenizer	Benchmarking, pretokenized data

Common Tokenizer Types#

LLaMA / Mistral#

MegatronTokenizer.write_metadata(
    tokenizer_path="/path/to/llama/tokenizer.model",
    tokenizer_library="sentencepiece",
)

GPT-2#

MegatronTokenizer.write_metadata(
    tokenizer_path="GPT2BPETokenizer",
    tokenizer_library="megatron",
    vocab_file="/path/to/gpt2-vocab.json",
    merge_file="/path/to/gpt2-merges.txt",
)

Recommendations#

Save metadata - Create metadata once, then reuse across training runs
Prefer Hugging Face tokenizers - When the model ships one, it reduces integration work
Test tokenization - Verify encode and decode before long training jobs
Version control metadata - Track tokenizer_metadata.json with experiment configs
Share tokenizer directories - Ship model files and metadata together for reproducibility

Next Steps#

Prepare data: Refer to Data Preparation for preprocessing with tokenizers
Train models: Refer to Training Examples
Supported models: Refer to Language Models for model-specific tokenizers