bridge.training.tokenizers.config#

Module Contents#

Classes#

TokenizerConfig

Configuration settings for the tokenizer.

API#

class bridge.training.tokenizers.config.TokenizerConfig#

Configuration settings for the tokenizer.

legacy_tokenizer: Optional[bool]#

False

To use Megatron-Bridge legacy tokenizer system.

metadata_path: Optional[Union[str | dict]]#

None

Path to the tokenizer metadata file.

vocab_size: Optional[int]#

None

Size of vocab before EOD or padding.

vocab_file: Optional[str]#

None

Path to the vocab file.

merge_file: Optional[str]#

None

Path to the BPE merge file.

vocab_extra_ids: int#

0

Number of additional vocabulary tokens. They are used for span masking in the T5 model

tokenizer_type: Optional[Literal[BertWordPieceLowerCase, BertWordPieceCase, GPT2BPETokenizer, SentencePieceTokenizer, GPTSentencePieceTokenizer, HuggingFaceTokenizer, Llama2Tokenizer, TikTokenizer, MultimodalTokenizer, NullTokenizer]]#

None

What type of tokenizer to use.

tokenizer_model: Optional[Union[str, pathlib.Path]]#

None

Sentencepiece tokenizer model or the pretrained_model_name_or_path for a HuggingFace tokenizer.

special_tokens: Optional[list[str]]#

None

List of special tokens. For TikToken, needs to have [””, “”, “”]

chat_template: Optional[str]#

None

Custom chat template in jinja format for conversation formatting

tiktoken_pattern: Optional[str]#

None

Which tiktoken pattern to use. Options: [v1, v2]

tiktoken_num_special_tokens: int#

1000

Number of special tokens in tiktoken tokenizer

tiktoken_special_tokens: Optional[list[str]]#

None

List of tiktoken special tokens, needs to have [””, “”, “”]

tokenizer_prompt_format: Optional[str]#

None

image_tag_type: Optional[str]#

None

hf_tokenizer_kwargs: dict[str, Any] | None#

‘field(…)’

Additional keyword arguments to pass to HuggingFace AutoTokenizer.from_pretrained.

Common options include: - use_fast (bool): Whether to use fast tokenizer implementation - trust_remote_code (bool): Whether to trust remote code when loading tokenizer - include_special_tokens (bool): Whether to include special tokens when converting text to ids

.. rubric:: Example

hf_tokenizer_kwargs = { “use_fast”: True, “trust_remote_code”: True, “include_special_tokens”: True }

sp_tokenizer_kwargs: dict[str, Any] | None#

‘field(…)’

Additional keyword arguments to pass to SentencePiece tokenizer.

Common options include: - legacy (bool): Whether to use legacy format of sentencepiece tokenizer - ignore_extra_whitespaces (bool): Whether to ignore extra whitespaces in the input text while encoding

.. rubric:: Example

sp_tokenizer_kwargs = { “legacy”: True, “ignore_extra_whitespaces”: False, }