core.tokenizers.text.text_tokenizer#

Module Contents#

Classes#

MegatronTokenizerText

Base class for Megatron text tokenizers.

Data#

API#

core.tokenizers.text.text_tokenizer.TOKENIZER_MAPPING_LIBRARIES#

‘OrderedDict(…)’

class core.tokenizers.text.text_tokenizer.MegatronTokenizerText(path: str, config: dict, **kwargs)#

Bases: megatron.core.tokenizers.base_tokenizer.MegatronTokenizerBase

Base class for Megatron text tokenizers.

Initialization

Parameters:
  • path (str) – path to the tokenizer model.

  • config (dict) – tokenizer parameters. library (str): tokenizer library. class_name (str): name of tokenizer class. class_path (str): path to tokenizer class. model_type (str): type of the model to be used with tokenizer. chat_template (str): tokenizer chat template.

_restore_model(
**kwargs,
) megatron.core.tokenizers.text.libraries.abstract_tokenizer.MegatronTokenizerTextAbstract#

Returns tokenizer library object.

tokenize(text: str) List[int]#

Text tokenization.

Parameters:

text (str) – text to be tokenized.

Returns:

list of ids.

Return type:

list

detokenize(ids: List[int]) str#

Text detokenization.

Parameters:

ids (list) – text to be tokenized.

Returns:

dettokenized text.

Return type:

text

apply_chat_template(
conversation: List[Dict[str, str]],
chat_template: Optional[str] = None,
**kwargs,
) Union[str, list]#

Applies chat template to the conversation.

Parameters:
  • conversation (list)

  • chat_template (Optional[str]) – chat template to be use. If not specified, tokenizer’s chat template will be used.

Returns:

a chat with applied chat template or a list of token ids.

Return type:

Union[str, list]

save_pretrained(path: str) None#

Saves HF tokenizer files.

Parameters:

path (str) – path where to save tokenizer files.

add_special_tokens(special_tokens: Union[list, dict]) None#

Adds a dictionary of special tokens (eos, pad, cls…). Tokens are only added if they are not already in the vocabulary. Indexed starting from the last index of the current vocabulary.

Parameters:

special_tokens_dict – dict of string. Keys should be in the list of predefined special attributes: [bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens].

property additional_special_tokens_ids: list#

Returns a list of the additional special tokens.

property vocab_size: int#

Returns vocabulary size.

property vocab#

Returns tokenizer vocabulary.

property unique_identifiers: collections.OrderedDict#

Returns a dictionary of unique identifiers.

property pad: int#

Returns id of padding token.

property pad_id: int#

Returns id of padding token. Need for NeMo.

property eod: int#

Returns id of end of document token.

property bos: int#

Returns id of beginning of sentence token.

property bos_id: int#

Returns id of beginning of sentence token. Need for NeMo.

property eos_id: int#

Returns id of end of sentence token.

property eos: int#

Returns id of end of sentence token. Need for legacy.

property unk: int#

Returns id of of unknown token.

property unk_id: int#

Returns id of of unknown token. Need for NeMo.

property mask: int#

Returns id of of mask token.

property mask_id: int#

Returns id of of mask token. Need for NeMo.

property cls: int#

Returns id of classification token.

property cls_id: int#

Returns id of classification token. Need for NeMo.

property sep: int#

Returns id of SEP token.

property sep_id: int#

Returns id of SEP token. Need for NeMo.

property vocab_file: str#

Returns vocabulary file path if specified.

property merges_file: str#

Returns merges file path if specified.

property inv_vocab: dict#

Returns tokenizer vocab with reversed keys and values.