core.tokenizers.base_tokenizer#

Module Contents#

Classes#

MegatronTokenizerBase

Abstract class for Megatron tokenizers.

API#

class core.tokenizers.base_tokenizer.MegatronTokenizerBase(path: str, config: dict, **kwargs)#

Bases: abc.ABC

Abstract class for Megatron tokenizers.

Initialization

Parameters:
  • path (str) – path to the tokenizer model.

  • config (dict) – tokenizer parameters. library (str): tokenizer library. class_name (str): name of tokenizer class. class_path (str): path to tokenizer class. model_type (str): type of the model to be used with tokenizer. chat_template (str): tokenizer chat template.

abstractmethod tokenize()#

Encoding function.

abstractmethod detokenize()#

Decoding function.

abstractmethod vocab()#

Returns tokenizer vocab.

abstractmethod vocab_size()#

Returns tokenizer vocab size.

abstractmethod apply_chat_template()#

Applies tokenizer’s chat template.