Tokenizers#
NeMo 1.0 (Previous Release)#
In NeMo 1.0, tokenizers were configured in the tokenizer section of the YAML configuration file.
NeMo 2.0 (New Release)#
In NeMo 2.0, tokenizers can be initialized directly in Python. get_nmt_tokenizer
is a utility function used in NeMo to instantiate many of the common tokenizers used for llm and multimodal training. For example, the following code will construct a GPT2BPETokenizer
:
from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer
tokenizer = get_nmt_tokenizer(
library="megatron",
model_name="GPT2BPETokenizer",
vocab_file="/path/to/vocab",
merges_file="/path/to/merges",
)
The following will construct a SentencePiece
tokenizer:
from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer
tokenizer = get_nmt_tokenizer(
library="sentencepiece",
tokenizer_model='/path/to/sentencepiece/model'
)
The following will construct a Hugging Face
tokenizer:
from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer
tokenizer = get_nmt_tokenizer(
library="huggingface",
model_name='nvidia/Minitron-4B-Base',
use_fast=True,
)
Refer to the get_nmt_tokenizer
code for a full list of supported arguments.
To set up the tokenizer using nemo_run, use the following code:
import nemo_run as run from nemo.collections.common.tokenizers import SentencePieceTokenizer from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer # Set up Sentence Piece tokenizer tokenizer = run.Config(SentencePieceTokenizer, model_path="/path/to/tokenizer.model") # Set up Hugging Face tokenizer tokenizer = run.Config(AutoTokenizer, pretrained_model_name="/path/to/tokenizer/model")
Refer to the SentencePieceTokenizer or AutoTokenizer code for a full list of supported arguments.
To change the tokenizer path for model recipe, use the following code:
from nemo.collections import llm
recipe = partial(llm.llama3_8b)()
# Change path for Hugging Face tokenizer
recipe.data.tokenizer.pretrained_model_name = "/path/to/tokenizer/model"
# Change tokenizer path for Sentence Piece tokenizer
recipe.data.tokenizer.model_path = "/path/to/tokenizer.model"
Basic NeMo 2.0 recipes can contain predefined tokenizers. Visit this page to see an example of setting up the tokenizer in the recipe.