bridge.training.tokenizers.tokenizer#

Megatron tokenizers.

Module Contents#

Functions#

build_tokenizer

Initialize tokenizer from megatron.core.tokenizers based on the provided configuration.

Data#

API#

bridge.training.tokenizers.tokenizer.MEGATRON_TOKENIZERS#

[‘BertWordPieceLowerCase’, ‘BertWordPieceCase’, ‘GPT2BPETokenizer’]

bridge.training.tokenizers.tokenizer.SP_TOKENIZERS#

[‘SentencePieceTokenizer’, ‘GPTSentencePieceTokenizer’, ‘Llama2Tokenizer’]

bridge.training.tokenizers.tokenizer.build_tokenizer(
config: megatron.bridge.training.tokenizers.config.TokenizerConfig,
**kwargs,
) megatron.core.tokenizers.MegatronTokenizer#

Initialize tokenizer from megatron.core.tokenizers based on the provided configuration.

Parameters:

config (TokenizerConfig) – Configuration object specifying the tokenizer type, paths to vocab/model files, and other tokenizer-specific settings.

Returns:

An instance of the initialized tokenizer.

Return type:

MegatronTokenizer