bridge.training.tokenizers.tokenizer#
Megatron tokenizers.
Module Contents#
Functions#
Initialize tokenizer from megatron.core.tokenizers based on the provided configuration. |
Data#
API#
- bridge.training.tokenizers.tokenizer.MEGATRON_TOKENIZERS#
[‘BertWordPieceLowerCase’, ‘BertWordPieceCase’, ‘GPT2BPETokenizer’]
- bridge.training.tokenizers.tokenizer.SP_TOKENIZERS#
[‘SentencePieceTokenizer’, ‘GPTSentencePieceTokenizer’, ‘Llama2Tokenizer’]
- bridge.training.tokenizers.tokenizer.build_tokenizer(
- config: megatron.bridge.training.tokenizers.config.TokenizerConfig,
- **kwargs,
Initialize tokenizer from megatron.core.tokenizers based on the provided configuration.
- Parameters:
config (TokenizerConfig) – Configuration object specifying the tokenizer type, paths to vocab/model files, and other tokenizer-specific settings.
- Returns:
An instance of the initialized tokenizer.
- Return type:
MegatronTokenizer