bridge.training.tokenizers.config#
Module Contents#
Classes#
Configuration settings for tokenizers. |
API#
- class bridge.training.tokenizers.config.TokenizerConfig#
Bases:
megatron.training.config.TokenizerConfigConfiguration settings for tokenizers.
- make_vocab_size_divisible_by: int#
1
Keep MCore tokenizer padding neutral; model providers apply vocab padding.
- tensor_model_parallel_size: int#
1
Tensor parallel size used by MCore tokenizer padded vocab-size calculation.
- rank: int#
0
Distributed rank used by MCore tokenizer helper logging.
- hf_tokenizer_kwargs: dict[str, Any] | None#
‘field(…)’
Additional keyword arguments to pass to HuggingFace AutoTokenizer.from_pretrained.
Common options include: - use_fast (bool): Whether to use fast tokenizer implementation - trust_remote_code (bool): Whether to trust remote code when loading tokenizer - include_special_tokens (bool): Whether to include special tokens when converting text to ids
.. rubric:: Example
hf_tokenizer_kwargs = { “use_fast”: True, “trust_remote_code”: True, “include_special_tokens”: True }
- sp_tokenizer_kwargs: dict[str, Any] | None#
‘field(…)’
Additional keyword arguments to pass to SentencePiece tokenizer.
Common options include: - legacy (bool): Whether to use legacy format of sentencepiece tokenizer
.. rubric:: Example
sp_tokenizer_kwargs = { “legacy”: True, }
- tokenizer_prompt_format: Optional[str]#
None
Prompt format for the tokenizer.
- image_tag_type: Optional[str]#
None
Image tag to apply, if any. For example
.
- force_system_message: Optional[bool]#
False
- __post_init__() None#
Sync with MCore values