core.tokenizers.megatron_tokenizer#

Module Contents#

Classes#

MegatronTokenizer

Restores model tokenizer.

Functions#

_get_metadata_path

Returns metadata file path.

Data#

API#

core.tokenizers.megatron_tokenizer.TOKENIZER_MAPPING_NAMES#

‘OrderedDict(…)’

core.tokenizers.megatron_tokenizer.TOKENIZER_LIBRARIES#

[‘sentencepiece’, ‘huggingface’, ‘megatron’, ‘tiktoken’, ‘byte-level’, ‘null’]

core.tokenizers.megatron_tokenizer.logger#

‘getLogger(…)’

class core.tokenizers.megatron_tokenizer.MegatronTokenizer#

Restores model tokenizer.

Initialization

from_pretrained(
metadata_path: Optional[Union[str, dict]] = None,
**kwargs,
) megatron.core.tokenizers.base_tokenizer.MegatronTokenizerBase#
Parameters:
  • path (str) – path to tokenizer file with metadata.json in folder.

  • metadata_path (Optional[str]) – path to the tokenizer metadata. Must be specified when loading the tokenizer from HF.

Returns:

tokenizer object.

Return type:

MegatronTokenizerBase

Usage: MegatronTokenizer.from_pretrained(tokenizer_path=’/path/to/tokenzier’)

write_metadata(
tokenizer_library: str,
model_type: Optional[str] = None,
tokenizer_class: Optional[megatron.core.tokenizers.base_tokenizer.MegatronTokenizerBase] = None,
chat_template: Optional[str] = None,
overwrite: Optional[bool] = False,
metadata_path: Optional[str] = None,
) None#

Creates metadata file for tokenizer.

Parameters:
  • tokenizer_path (str) – path to tokenizer model.

  • tokenizer_library (str) – tokenizer model library.

  • model_type (str) – type of the model to be used with tokenizer. list of available model types: [gpt, bert, t5, mamba, retro, default]. DefaultTokenizerText will be used if model_type is not specified.

  • tokenizer_class (MegatronTokenizerBase) – pre-defined tokenizer class.

  • chat_template (str) – tokenizer chat template in jinja format.

  • overwrite (bool) – overwrites existing metadata file if set to True.

  • metadata_path (Optional[str]) – path where metadata file will be saved. If not specified, the metadata file will be stored in the same directory as the tokenizer.

Usage: MegatronTokenizer.write_metadata( tokenizer_path=’/path/to/tokenzier/model’, tokenizer_library=’sentencepiece’, model_type=’llama’, )

core.tokenizers.megatron_tokenizer._get_metadata_path(tokenizer_path: str) str#

Returns metadata file path.

Parameters:

tokenizer_path (str) – path to the tokenizer model.

Returns:

path to the metadata file.

Return type:

str