core.tokenizers.megatron_tokenizer#
Module Contents#
Classes#
Restores model tokenizer. |
Functions#
Returns metadata file path. |
Data#
API#
- core.tokenizers.megatron_tokenizer.TOKENIZER_MAPPING_NAMES#
‘OrderedDict(…)’
- core.tokenizers.megatron_tokenizer.TOKENIZER_LIBRARIES#
[‘sentencepiece’, ‘huggingface’, ‘megatron’, ‘tiktoken’, ‘byte-level’, ‘null’]
- core.tokenizers.megatron_tokenizer.logger#
‘getLogger(…)’
- class core.tokenizers.megatron_tokenizer.MegatronTokenizer#
Restores model tokenizer.
Initialization
- from_pretrained(
- metadata_path: Optional[Union[str, dict]] = None,
- **kwargs,
- Parameters:
path (str) – path to tokenizer file with metadata.json in folder.
metadata_path (Optional[str]) – path to the tokenizer metadata. Must be specified when loading the tokenizer from HF.
- Returns:
tokenizer object.
- Return type:
Usage: MegatronTokenizer.from_pretrained(tokenizer_path=’/path/to/tokenzier’)
- write_metadata(
- tokenizer_library: str,
- model_type: Optional[str] = None,
- tokenizer_class: Optional[megatron.core.tokenizers.base_tokenizer.MegatronTokenizerBase] = None,
- chat_template: Optional[str] = None,
- overwrite: Optional[bool] = False,
- metadata_path: Optional[str] = None,
Creates metadata file for tokenizer.
- Parameters:
tokenizer_path (str) – path to tokenizer model.
tokenizer_library (str) – tokenizer model library.
model_type (str) – type of the model to be used with tokenizer. list of available model types: [gpt, bert, t5, mamba, retro, default].
DefaultTokenizerTextwill be used if model_type is not specified.tokenizer_class (MegatronTokenizerBase) – pre-defined tokenizer class.
chat_template (str) – tokenizer chat template in jinja format.
overwrite (bool) – overwrites existing metadata file if set to True.
metadata_path (Optional[str]) – path where metadata file will be saved. If not specified, the metadata file will be stored in the same directory as the tokenizer.
Usage: MegatronTokenizer.write_metadata( tokenizer_path=’/path/to/tokenzier/model’, tokenizer_library=’sentencepiece’, model_type=’llama’, )
- core.tokenizers.megatron_tokenizer._get_metadata_path(tokenizer_path: str) str#
Returns metadata file path.
- Parameters:
tokenizer_path (str) – path to the tokenizer model.
- Returns:
path to the metadata file.
- Return type:
str