core.tokenizers.base_tokenizer#
Module Contents#
Classes#
Abstract class for Megatron tokenizers. |
API#
- class core.tokenizers.base_tokenizer.MegatronTokenizerBase(path: str, config: dict, **kwargs)#
Bases:
abc.ABCAbstract class for Megatron tokenizers.
Initialization
- Parameters:
path (str) – path to the tokenizer model.
config (dict) – tokenizer parameters. library (str): tokenizer library. class_name (str): name of tokenizer class. class_path (str): path to tokenizer class. model_type (str): type of the model to be used with tokenizer. chat_template (str): tokenizer chat template.
- abstractmethod tokenize()#
Encoding function.
- abstractmethod detokenize()#
Decoding function.
- abstractmethod vocab()#
Returns tokenizer vocab.
- abstractmethod vocab_size()#
Returns tokenizer vocab size.
- abstractmethod apply_chat_template()#
Applies tokenizer’s chat template.