Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Tokenizers#
- class nemo.collections.common.tokenizers.AutoTokenizer(
- pretrained_model_name: str,
- vocab_file: str | None = None,
- merges_file: str | None = None,
- mask_token: str | None = None,
- bos_token: str | None = None,
- eos_token: str | None = None,
- pad_token: str | None = None,
- sep_token: str | None = None,
- cls_token: str | None = None,
- unk_token: str | None = None,
- additional_special_tokens: List | None = [],
- use_fast: bool | None = False,
- trust_remote_code: bool | None = False,
Wrapper of HuggingFace AutoTokenizer https://huggingface.co/transformers/model_doc/auto.html#autotokenizer.
- __init__(
- pretrained_model_name: str,
- vocab_file: str | None = None,
- merges_file: str | None = None,
- mask_token: str | None = None,
- bos_token: str | None = None,
- eos_token: str | None = None,
- pad_token: str | None = None,
- sep_token: str | None = None,
- cls_token: str | None = None,
- unk_token: str | None = None,
- additional_special_tokens: List | None = [],
- use_fast: bool | None = False,
- trust_remote_code: bool | None = False,
- Parameters:
pretrained_model_name – corresponds to HuggingFace-AutoTokenizer’s ‘pretrained_model_name_or_path’ input argument. For more details please refer to https://huggingface.co/transformers/_modules/transformers/tokenization_auto.html#AutoTokenizer.from_pretrained. The list of all supported models can be found here: ALL_PRETRAINED_CONFIG_ARCHIVE_MAP
vocab_file – path to file with vocabulary which consists of characters separated by newlines.
mask_token – mask token
bos_token – the beginning of sequence token
eos_token – the end of sequence token. Usually equal to sep_token
pad_token – token to use for padding
sep_token – token used for separating sequences
cls_token – class token. Usually equal to bos_token
unk_token – token to use for unknown tokens
additional_special_tokens – list of other tokens beside standard special tokens (bos, eos, pad, etc.). For example, sentinel tokens for T5 (<extra_id_0>, <extra_id_1>, etc.)
use_fast – whether to use fast HuggingFace tokenizer
- class nemo.collections.common.tokenizers.SentencePieceTokenizer(
- model_path: str,
- special_tokens: Dict[str, str] | List[str] | None = None,
- legacy: bool = False,
- ignore_extra_whitespaces: bool = True,
- chat_template: Dict | None = None,
Sentencepiecetokenizer google/sentencepiece.
- Parameters:
model_path – path to sentence piece tokenizer model. To create the model use create_spt_model()
special_tokens – either list of special tokens or dictionary of token name to token value
legacy – when set to True, the previous behavior of the SentecePiece wrapper will be restored, including the possibility to add special tokens inside wrapper.
ignore_extra_whitespaces – whether to ignore extra whitespaces in the input text while encoding. Note: This is done for the current models tokenizers that don’t handle extra whitespaces as by default tokenizer learned to ignore it. To check if the tokenizer by default ignores extra whitespaces refer to self.removed_extra_spaces attribute of the tokenizer. We added a parameter to process_asr_tokenizer.py for upcoming models to handle it inbuilt.
- __init__(
- model_path: str,
- special_tokens: Dict[str, str] | List[str] | None = None,
- legacy: bool = False,
- ignore_extra_whitespaces: bool = True,
- chat_template: Dict | None = None,