Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Tokenizers#
NeMo 1.0 (Previous Release)#
In NeMo 1.0, tokenizers were configured in the tokenizer section of the YAML configuration file.
NeMo 2.0 (New Release)#
In NeMo 2.0, tokenizers can be initialized directly in Python. get_nmt_tokenizer
is a utility function used in NeMo to instantiate many of the common tokenizers used for llm and multimodal training. For example, the following code will construct a GPT2BPETokenizer.
from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer
tokenizer = get_nmt_tokenizer(
library="megatron",
model_name="GPT2BPETokenizer",
vocab_file="/path/to/vocab",
merges_file="/path/to/merges",
)
While the following will construct a SentencePiece tokenizer.
from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer
tokenizer = get_nmt_tokenizer(
library="sentencepiece",
tokenizer_model='/path/to/sentencepiece/model'
)
Refer to the get_nmt_tokenizer code for a full list of supported arguments.