Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Tokenizers#

class nemo.collections.common.tokenizers.AutoTokenizer(
pretrained_model_name: str,
vocab_file: str | None = None,
merges_file: str | None = None,
mask_token: str | None = None,
bos_token: str | None = None,
eos_token: str | None = None,
pad_token: str | None = None,
sep_token: str | None = None,
cls_token: str | None = None,
unk_token: str | None = None,
use_fast: bool | None = False,
trust_remote_code: bool | None = False,
)#

Wrapper of HuggingFace AutoTokenizer https://huggingface.co/transformers/model_doc/auto.html#autotokenizer.

__init__(
pretrained_model_name: str,
vocab_file: str | None = None,
merges_file: str | None = None,
mask_token: str | None = None,
bos_token: str | None = None,
eos_token: str | None = None,
pad_token: str | None = None,
sep_token: str | None = None,
cls_token: str | None = None,
unk_token: str | None = None,
use_fast: bool | None = False,
trust_remote_code: bool | None = False,
)#
Parameters:
  • pretrained_model_name – corresponds to HuggingFace-AutoTokenizer’s ‘pretrained_model_name_or_path’ input argument. For more details please refer to https://huggingface.co/transformers/_modules/transformers/tokenization_auto.html#AutoTokenizer.from_pretrained. The list of all supported models can be found here: ALL_PRETRAINED_CONFIG_ARCHIVE_MAP

  • vocab_file – path to file with vocabulary which consists of characters separated by newlines.

  • mask_token – mask token

  • bos_token – the beginning of sequence token

  • eos_token – the end of sequence token. Usually equal to sep_token

  • pad_token – token to use for padding

  • sep_token – token used for separating sequences

  • cls_token – class token. Usually equal to bos_token

  • unk_token – token to use for unknown tokens

  • use_fast – whether to use fast HuggingFace tokenizer

class nemo.collections.common.tokenizers.SentencePieceTokenizer(
model_path: str,
special_tokens: Dict[str, str] | List[str] | None = None,
legacy: bool = False,
chat_template: Dict | None = None,
)#

Sentencepiecetokenizer google/sentencepiece.

Parameters:
  • model_path – path to sentence piece tokenizer model. To create the model use create_spt_model()

  • special_tokens – either list of special tokens or dictionary of token name to token value

  • legacy – when set to True, the previous behavior of the SentecePiece wrapper will be restored, including the possibility to add special tokens inside wrapper.

__init__(
model_path: str,
special_tokens: Dict[str, str] | List[str] | None = None,
legacy: bool = False,
chat_template: Dict | None = None,
)#
class nemo.collections.common.tokenizers.TokenizerSpec#

Inherit this class to implement a new tokenizer.

__init__()#