Megatron Core User Guide

New Tokenizer System

1. Hugging Face–style API

We now have a MegatronTokenizer class that provides a familiar, simple API similar to Hugging Face’s:

.from_pretrained() – Load a tokenizer from a directory or file, automatically detecting the type and settings.

.write_metadata() – Save tokenizer configuration (metadata) so that it can be reused without re-specifying parameters.

This eliminates the need for long initialization arguments and hard-coded settings in training scripts.

2. Tokenizer Metadata

A metadata file (JSON) now stores all essential tokenizer configuration in one place:

  • Tokenizer library (e.g., HuggingFace, SentencePiece, TikToken, etc.)

  • Chat templates

  • Tokenizer class

Benefits:

  • You only need to set these parameters once.

  • No more passing multiple CLI arguments for tokenizer settings.

  • Easy sharing — just copy the tokenizer directory with its metadata file.

3. Library Classes Are Now Internal

In the old system, you had to know which tokenizer library to use (SentencePieceTokenizer, HuggingFaceTokenizer, etc.) and instantiate it manually.

In the new system:

  • The library is automatically detected from the metadata.

  • The correct tokenizer implementation is chosen under the hood.

  • Users don’t need to manually manage tokenizer classes.

3. Support for Model-specific Tokenizer Classes

The system now supports:

  • Built-in LLM-specific tokenizers.

  • Custom tokenizers: You can create your own tokenizer class by inheriting from MegatronTokenizerText and specify it in the tokenizer_class field in the metadata file.

  • This allows advanced customization while keeping defaults simple for most users.

4. Usage

Creating and Saving Metadata

Copy
Copied!
            

from megatron.core.tokenizers import MegatronTokenizer # The metadata will be stored as a file named tokenizer_metadata.json inside the tokenizer’s directory. MegatronTokenizer.write_metadata( tokenizer_path="/path/to/tokenizer.model", tokenizer_library="sentencepiece", chat_template="chat template in jinja format", ) # To use custom tokenizer class from megatron.core.tokenizers.text import MegatronTokenizerText class CustomTokenizer(MegatronTokenizerText): ... MegatronTokenizer.write_metadata( tokenizer_path="/path/to/tokenizer.model", tokenizer_library="sentencepiece", chat_template="chat template in jinja format", tokenizer_class=CustomTokenizer, ) # To save metadata to another dir MegatronTokenizer.write_metadata( tokenizer_path="/path/to/tokenizer.model", tokenizer_library="sentencepiece", metadata_path="/path/to/save/metadata.json", )

Restoring the tokenizer

Copy
Copied!
            

from megatron.core.tokenizers import MegatronTokenizer MegatronTokenizer.from_pretrained( tokenizer_path="/path/to/tokenizer.model", ) # If metadata is not in tokenizer’s dir MegatronTokenizer.from_pretrained( tokenizer_path="/path/to/tokenizer.model", metadata_path="/path/to/metadata.json", ) # Pass metadata as dict MegatronTokenizer.from_pretrained( tokenizer_path="GPT2BPETokenizer", metadata_path={"library": "megatron"}, vocab_file="/path/to/vocab.txt", ) # Pass additional params MegatronTokenizer.from_pretrained( tokenizer_path="/path/to/tokenizer/model.json", metadata_path={"library": "tiktoken"}, pattern="v2", num_special_tokens=1000, ) # Null tokenzier MegatronTokenizer.from_pretrained( metadata_path={"library": "null"}, vocab_size=131072, )

4. Megatron-LM pretraining compatibility

New tokenizer system is compatible with megatron-lm pretrain script. If --tokenizer-metadata is not specified, a default metadata file will be generated automatically.

Copy
Copied!
            

# Null tokenizer torchrun --nproc_per_node=1 pretrain_gpt.py \ ... \ --tokenizer-type NullTokenizer \ --vocab-size 131072 # HuggingFace tokenizer with specified metadata torchrun --nproc_per_node=1 pretrain_gpt.py \ ... \ --tokenizer-type HuggingFaceTokenizer \ --tokenizer-model meta-llama/Meta-Llama-3-8B \ --tokenizer-metadata /path/to/metadata.json

The Megatron-LM pretraining script still supports the legacy tokenizer system. To enable it, simply add the --legacy-tokenizer flag.

Previous Multi-Token Prediction (MTP)
© Copyright 2022-2025, NVIDIA. Last updated on Sep 16, 2025.