`core.tokenizers.text.text_tokenizer`#

Module Contents#

Classes#

MegatronTokenizerText

Base class for Megatron text tokenizers.

Data#

TOKENIZER_MAPPING_LIBRARIES

API#

core.tokenizers.text.text_tokenizer.TOKENIZER_MAPPING_LIBRARIES#: ‘OrderedDict(…)’

class core.tokenizers.text.text_tokenizer.MegatronTokenizerText(path: str, config: dict, **kwargs)#

Bases: megatron.core.tokenizers.base_tokenizer.MegatronTokenizerBase

Base class for Megatron text tokenizers.

Initialization

Parameters:

path (str) – path to the tokenizer model.
config (dict) – tokenizer parameters. library (str): tokenizer library. class_name (str): name of tokenizer class. class_path (str): path to tokenizer class. model_type (str): type of the model to be used with tokenizer. chat_template (str): tokenizer chat template.

_restore_model(

**kwargs,

) → megatron.core.tokenizers.text.libraries.abstract_tokenizer.MegatronTokenizerTextAbstract#: Returns tokenizer library object.

tokenize(text: str) → List[int]#

Text tokenization.

Parameters:: text (str) – text to be tokenized.
Returns:: list of ids.
Return type:: list

detokenize(ids: List[int]) → str#

Text detokenization.

Parameters:: ids (list) – text to be tokenized.
Returns:: dettokenized text.
Return type:: text

apply_chat_template(

conversation: List[Dict[str, str]],

chat_template: Optional[str] = None,

**kwargs,

) → Union[str, list]#

Applies chat template to the conversation.

Parameters:

conversation (list)
chat_template (Optional[str]) – chat template to be use. If not specified, tokenizer’s chat template will be used.

Returns:

a chat with applied chat template or a list of token ids.

Return type:

Union[str, list]

save_pretrained(path: str) → None#

Saves HF tokenizer files.

Parameters:: path (str) – path where to save tokenizer files.

add_special_tokens(special_tokens: Union[list, dict]) → None#

Adds a dictionary of special tokens (eos, pad, cls…). Tokens are only added if they are not already in the vocabulary. Indexed starting from the last index of the current vocabulary.

Parameters:: special_tokens_dict – dict of string. Keys should be in the list of predefined special attributes: [bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens].

property additional_special_tokens_ids: list#: Returns a list of the additional special tokens.

property vocab_size: int#: Returns vocabulary size.

property vocab#: Returns tokenizer vocabulary.

property unique_identifiers: collections.OrderedDict#: Returns a dictionary of unique identifiers.

property pad: int#: Returns id of padding token.

property pad_id: int#: Returns id of padding token. Need for NeMo.

property eod: int#: Returns id of end of document token.

property bos: int#: Returns id of beginning of sentence token.

property bos_id: int#: Returns id of beginning of sentence token. Need for NeMo.

property eos_id: int#: Returns id of end of sentence token.

property eos: int#: Returns id of end of sentence token. Need for legacy.

property unk: int#: Returns id of of unknown token.

property unk_id: int#: Returns id of of unknown token. Need for NeMo.

property mask: int#: Returns id of of mask token.

property mask_id: int#: Returns id of of mask token. Need for NeMo.

property cls: int#: Returns id of classification token.

property cls_id: int#: Returns id of classification token. Need for NeMo.

property sep: int#: Returns id of SEP token.

property sep_id: int#: Returns id of SEP token. Need for NeMo.

property vocab_file: str#: Returns vocabulary file path if specified.

property merges_file: str#: Returns merges file path if specified.

property inv_vocab: dict#: Returns tokenizer vocab with reversed keys and values.

core.tokenizers.text.text_tokenizer#

Module Contents#

Classes#

Data#

API#

`core.tokenizers.text.text_tokenizer`#