bridge.training.tokenizers.config#
Module Contents#
Classes#
Configuration settings for the tokenizer. |
API#
- class bridge.training.tokenizers.config.TokenizerConfig#
Configuration settings for the tokenizer.
- legacy_tokenizer: Optional[bool]#
False
To use Megatron-Bridge legacy tokenizer system.
- metadata_path: Optional[Union[str | dict]]#
None
Path to the tokenizer metadata file.
- vocab_size: Optional[int]#
None
Size of vocab before EOD or padding.
- vocab_file: Optional[str]#
None
Path to the vocab file.
- merge_file: Optional[str]#
None
Path to the BPE merge file.
- vocab_extra_ids: int#
0
Number of additional vocabulary tokens. They are used for span masking in the T5 model
- tokenizer_type: Optional[Literal[BertWordPieceLowerCase, BertWordPieceCase, GPT2BPETokenizer, SentencePieceTokenizer, GPTSentencePieceTokenizer, HuggingFaceTokenizer, Llama2Tokenizer, TikTokenizer, MultimodalTokenizer, NullTokenizer]]#
None
What type of tokenizer to use.
- tokenizer_model: Optional[Union[str, pathlib.Path]]#
None
Sentencepiece tokenizer model or the
pretrained_model_name_or_pathfor a HuggingFace tokenizer.
- special_tokens: Optional[list[str]]#
None
List of special tokens. For TikToken, needs to have [”
”, “ ”, “”]
- chat_template: Optional[str]#
None
Custom chat template in jinja format for conversation formatting
- tiktoken_pattern: Optional[str]#
None
Which tiktoken pattern to use. Options: [v1, v2]
- tiktoken_num_special_tokens: int#
1000
Number of special tokens in tiktoken tokenizer
- tiktoken_special_tokens: Optional[list[str]]#
None
List of tiktoken special tokens, needs to have [”
”, “ ”, “”]
- tokenizer_prompt_format: Optional[str]#
None
- image_tag_type: Optional[str]#
None
- hf_tokenizer_kwargs: dict[str, Any] | None#
‘field(…)’
Additional keyword arguments to pass to HuggingFace AutoTokenizer.from_pretrained.
Common options include: - use_fast (bool): Whether to use fast tokenizer implementation - trust_remote_code (bool): Whether to trust remote code when loading tokenizer - include_special_tokens (bool): Whether to include special tokens when converting text to ids
.. rubric:: Example
hf_tokenizer_kwargs = { “use_fast”: True, “trust_remote_code”: True, “include_special_tokens”: True }
- sp_tokenizer_kwargs: dict[str, Any] | None#
‘field(…)’
Additional keyword arguments to pass to SentencePiece tokenizer.
Common options include: - legacy (bool): Whether to use legacy format of sentencepiece tokenizer - ignore_extra_whitespaces (bool): Whether to ignore extra whitespaces in the input text while encoding
.. rubric:: Example
sp_tokenizer_kwargs = { “legacy”: True, “ignore_extra_whitespaces”: False, }