Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Tokenizers#
- class nemo.collections.common.tokenizers.AutoTokenizer(
- pretrained_model_name: str,
- vocab_file: str | None = None,
- merges_file: str | None = None,
- mask_token: str | None = None,
- bos_token: str | None = None,
- eos_token: str | None = None,
- pad_token: str | None = None,
- sep_token: str | None = None,
- cls_token: str | None = None,
- unk_token: str | None = None,
- use_fast: bool | None = False,
- trust_remote_code: bool | None = False,
Wrapper of HuggingFace AutoTokenizer https://huggingface.co/transformers/model_doc/auto.html#autotokenizer.
- __init__(
- pretrained_model_name: str,
- vocab_file: str | None = None,
- merges_file: str | None = None,
- mask_token: str | None = None,
- bos_token: str | None = None,
- eos_token: str | None = None,
- pad_token: str | None = None,
- sep_token: str | None = None,
- cls_token: str | None = None,
- unk_token: str | None = None,
- use_fast: bool | None = False,
- trust_remote_code: bool | None = False,
- Parameters:
pretrained_model_name – corresponds to HuggingFace-AutoTokenizer’s ‘pretrained_model_name_or_path’ input argument. For more details please refer to https://huggingface.co/transformers/_modules/transformers/tokenization_auto.html#AutoTokenizer.from_pretrained. The list of all supported models can be found here: ALL_PRETRAINED_CONFIG_ARCHIVE_MAP
vocab_file – path to file with vocabulary which consists of characters separated by newlines.
mask_token – mask token
bos_token – the beginning of sequence token
eos_token – the end of sequence token. Usually equal to sep_token
pad_token – token to use for padding
sep_token – token used for separating sequences
cls_token – class token. Usually equal to bos_token
unk_token – token to use for unknown tokens
use_fast – whether to use fast HuggingFace tokenizer
- class nemo.collections.common.tokenizers.SentencePieceTokenizer(
- model_path: str,
- special_tokens: Dict[str, str] | List[str] | None = None,
- legacy: bool = False,
- chat_template: Dict | None = None,
Sentencepiecetokenizer google/sentencepiece.
- Parameters:
model_path – path to sentence piece tokenizer model. To create the model use create_spt_model()
special_tokens – either list of special tokens or dictionary of token name to token value
legacy – when set to True, the previous behavior of the SentecePiece wrapper will be restored, including the possibility to add special tokens inside wrapper.
- __init__(
- model_path: str,
- special_tokens: Dict[str, str] | List[str] | None = None,
- legacy: bool = False,
- chat_template: Dict | None = None,