Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Tokenizers#

class nemo.collections.common.tokenizers.AutoTokenizer(
pretrained_model_name: str,
vocab_file: str | None = None,
merges_file: str | None = None,
mask_token: str | None = None,
bos_token: str | None = None,
eos_token: str | None = None,
pad_token: str | None = None,
sep_token: str | None = None,
cls_token: str | None = None,
unk_token: str | None = None,
additional_special_tokens: List | None = [],
use_fast: bool | None = False,
trust_remote_code: bool | None = False,
)#

Wrapper of HuggingFace AutoTokenizer https://huggingface.co/transformers/model_doc/auto.html#autotokenizer.

__init__(
pretrained_model_name: str,
vocab_file: str | None = None,
merges_file: str | None = None,
mask_token: str | None = None,
bos_token: str | None = None,
eos_token: str | None = None,
pad_token: str | None = None,
sep_token: str | None = None,
cls_token: str | None = None,
unk_token: str | None = None,
additional_special_tokens: List | None = [],
use_fast: bool | None = False,
trust_remote_code: bool | None = False,
)#
Parameters:
  • pretrained_model_name – corresponds to HuggingFace-AutoTokenizer’s ‘pretrained_model_name_or_path’ input argument. For more details please refer to https://huggingface.co/transformers/_modules/transformers/tokenization_auto.html#AutoTokenizer.from_pretrained. The list of all supported models can be found here: ALL_PRETRAINED_CONFIG_ARCHIVE_MAP

  • vocab_file – path to file with vocabulary which consists of characters separated by newlines.

  • mask_token – mask token

  • bos_token – the beginning of sequence token

  • eos_token – the end of sequence token. Usually equal to sep_token

  • pad_token – token to use for padding

  • sep_token – token used for separating sequences

  • cls_token – class token. Usually equal to bos_token

  • unk_token – token to use for unknown tokens

  • additional_special_tokens – list of other tokens beside standard special tokens (bos, eos, pad, etc.). For example, sentinel tokens for T5 (<extra_id_0>, <extra_id_1>, etc.)

  • use_fast – whether to use fast HuggingFace tokenizer

class nemo.collections.common.tokenizers.SentencePieceTokenizer(
model_path: str,
special_tokens: Dict[str, str] | List[str] | None = None,
legacy: bool = False,
ignore_extra_whitespaces: bool = True,
chat_template: Dict | None = None,
)#

Sentencepiecetokenizer google/sentencepiece.

Parameters:
  • model_path – path to sentence piece tokenizer model. To create the model use create_spt_model()

  • special_tokens – either list of special tokens or dictionary of token name to token value

  • legacy – when set to True, the previous behavior of the SentecePiece wrapper will be restored, including the possibility to add special tokens inside wrapper.

  • ignore_extra_whitespaces – whether to ignore extra whitespaces in the input text while encoding. Note: This is done for the current models tokenizers that don’t handle extra whitespaces as by default tokenizer learned to ignore it. To check if the tokenizer by default ignores extra whitespaces refer to self.removed_extra_spaces attribute of the tokenizer. We added a parameter to process_asr_tokenizer.py for upcoming models to handle it inbuilt.

__init__(
model_path: str,
special_tokens: Dict[str, str] | List[str] | None = None,
legacy: bool = False,
ignore_extra_whitespaces: bool = True,
chat_template: Dict | None = None,
)#
class nemo.collections.common.tokenizers.TokenizerSpec#

Inherit this class to implement a new tokenizer.

__init__()#