Tokenizers#

NeMo 1.0 (Previous Release)#

In NeMo 1.0, tokenizers were configured in the tokenizer section of the YAML configuration file.

NeMo 2.0 (New Release)#

In NeMo 2.0, tokenizers can be initialized directly in Python. get_nmt_tokenizer is a utility function used in NeMo to instantiate many of the common tokenizers used for llm and multimodal training. For example, the following code will construct a GPT2BPETokenizer:

from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer

tokenizer = get_nmt_tokenizer(
    library="megatron",
    model_name="GPT2BPETokenizer",
    vocab_file="/path/to/vocab",
    merges_file="/path/to/merges",
)

The following will construct a SentencePiece tokenizer:

from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer

tokenizer = get_nmt_tokenizer(
    library="sentencepiece",
    tokenizer_model='/path/to/sentencepiece/model'
)

The following will construct a Hugging Face tokenizer:

from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer

tokenizer = get_nmt_tokenizer(
    library="huggingface",
    model_name='nvidia/Minitron-4B-Base',
    use_fast=True,
)

Refer to the get_nmt_tokenizer code for a full list of supported arguments.

To set up the tokenizer using nemo_run, use the following code:

import nemo_run as run
from nemo.collections.common.tokenizers import SentencePieceTokenizer
from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer

# Set up Sentence Piece tokenizer
tokenizer = run.Config(SentencePieceTokenizer, model_path="/path/to/tokenizer.model")

# Set up Hugging Face tokenizer
tokenizer = run.Config(AutoTokenizer, pretrained_model_name="/path/to/tokenizer/model")

Refer to the SentencePieceTokenizer or AutoTokenizer code for a full list of supported arguments.

To change the tokenizer path for model recipe, use the following code:

from nemo.collections import llm

recipe = partial(llm.llama3_8b)()

# Change path for Hugging Face tokenizer
recipe.data.tokenizer.pretrained_model_name = "/path/to/tokenizer/model"

# Change tokenizer path for Sentence Piece tokenizer
recipe.data.tokenizer.model_path = "/path/to/tokenizer.model"

Basic NeMo 2.0 recipes can contain predefined tokenizers. Visit this page to see an example of setting up the tokenizer in the recipe.