Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Tokenizers#

NeMo 1.0 (Previous Release)#

In NeMo 1.0, tokenizers were configured in the tokenizer section of the YAML configuration file.

NeMo 2.0 (New Release)#

In NeMo 2.0, tokenizers can be initialized directly in Python. get_nmt_tokenizer is a utility function used in NeMo to instantiate many of the common tokenizers used for llm and multimodal training. For example, the following code will construct a GPT2BPETokenizer.

from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer

tokenizer = get_nmt_tokenizer(
    library="megatron",
    model_name="GPT2BPETokenizer",
    vocab_file="/path/to/vocab",
    merges_file="/path/to/merges",
 )

While the following will construct a SentencePiece tokenizer.

from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer

tokenizer = get_nmt_tokenizer(
    library="sentencepiece",
    tokenizer_model='/path/to/sentencepiece/model'
 )

Refer to the get_nmt_tokenizer code for a full list of supported arguments.