Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Tokenizers#

NeMo 1.0 (Previous Release)#

In NeMo 1.0, tokenizers were configured in the tokenizer section of the YAML configuration file.

NeMo 2.0 (New Release)#

In NeMo 2.0, tokenizers can be initialized directly in Python. get_nmt_tokenizer is a utility function used in NeMo to instantiate many of the common tokenizers used for llm and multimodal training. For example, the following code will construct a GPT2BPETokenizer.

from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer

tokenizer = get_nmt_tokenizer(
    library="megatron",
    model_name="GPT2BPETokenizer",
    vocab_file="/path/to/vocab",
    merges_file="/path/to/merges",
 )

While the following will construct a SentencePiece tokenizer.

from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer

tokenizer = get_nmt_tokenizer(
    library="sentencepiece",
    tokenizer_model='/path/to/sentencepiece/model'
 )

Refer to the get_nmt_tokenizer code for a full list of supported arguments.