bridge.training.tokenizers.config#

Module Contents#

Classes#

TokenizerConfig

Configuration settings for tokenizers.

API#

class bridge.training.tokenizers.config.TokenizerConfig#

Bases: megatron.training.config.TokenizerConfig

Configuration settings for tokenizers.

make_vocab_size_divisible_by: int#

1

Keep MCore tokenizer padding neutral; model providers apply vocab padding.

tensor_model_parallel_size: int#

1

Tensor parallel size used by MCore tokenizer padded vocab-size calculation.

rank: int#

0

Distributed rank used by MCore tokenizer helper logging.

hf_tokenizer_kwargs: dict[str, Any] | None#

‘field(…)’

Additional keyword arguments to pass to HuggingFace AutoTokenizer.from_pretrained.

Common options include: - use_fast (bool): Whether to use fast tokenizer implementation - trust_remote_code (bool): Whether to trust remote code when loading tokenizer - include_special_tokens (bool): Whether to include special tokens when converting text to ids

.. rubric:: Example

hf_tokenizer_kwargs = { “use_fast”: True, “trust_remote_code”: True, “include_special_tokens”: True }

sp_tokenizer_kwargs: dict[str, Any] | None#

‘field(…)’

Additional keyword arguments to pass to SentencePiece tokenizer.

Common options include: - legacy (bool): Whether to use legacy format of sentencepiece tokenizer

.. rubric:: Example

sp_tokenizer_kwargs = { “legacy”: True, }

tokenizer_prompt_format: Optional[str]#

None

Prompt format for the tokenizer.

image_tag_type: Optional[str]#

None

Image tag to apply, if any. For example .

force_system_message: Optional[bool]#

False

__post_init__() None#

Sync with MCore values