`bridge.training.tokenizers.tokenizer`#

Megatron tokenizers.

Module Contents#

Classes#

`MegatronTokenizer`	Base tokenizer class, extending the MegatronTokenizer from megatron core.
`_HuggingFaceTokenizer`
`_BertWordPieceTokenizer`	Original BERT wordpiece tokenizer adapted for Megatron.
`_GPT2BPETokenizer`	Original GPT-2 BPE tokenizer adapted for Megatron.
`_SentencePieceTokenizer`	A wrapper for SentencePiece tokenizers used with Megatron.
`_GPTSentencePieceTokenizer`	A specialized SentencePiece tokenizer for GPT-style models.
`_Llama2Tokenizer`	A tokenizer specifically for Llama-2 style models, using SentencePiece. This class inherits from `_SentencePieceTokenizer` and is configured for Llama-2’s specific use of BOS and EOS tokens. It uses the BOS/EOS/PAD IDs directly from the SentencePiece model.
`CustomTikTokenizer`	A custom tokenizer using the Tiktoken library with a NeMo-style vocabulary file. This tokenizer loads a vocabulary from a JSON file (processed by `reload_mergeable_ranks`) and uses it with Tiktoken for encoding and decoding. It supports a configurable number of special tokens, which are placed at the beginning of the vocabulary ID space.
`_NullTokenizer`	A simple tokenizer that splits text by spaces and converts tokens to integers. This tokenizer is primarily for testing or placeholder purposes where actual linguistic tokenization is not required. It assumes tokens are space-separated integers.
`_NullMultimodalTokenizer`

Functions#

`_compute_space_sensitive`	Determine if a tokenizer is space-sensitive.
`build_tokenizer`	Initialize tokenizer based on the provided configuration.
`reload_mergeable_ranks`	Reloads a tokenizer vocabulary from a JSON file (NeMo format) and converts it into the mergeable ranks format required by Tiktoken. The input JSON file is expected to be a list of dictionaries, each with “rank”, “token_bytes” (base64 encoded), and “token_str” keys.

API#

bridge.training.tokenizers.tokenizer._compute_space_sensitive( tokenizer_instance: MegatronTokenizer, default: bool = True, ) → bool#

Determine if a tokenizer is space-sensitive.

A tokenizer is space-sensitive if tokenizing “x y” produces different token sequences than concatenating tokenize(“x”) + tokenize(“y”). This affects how prompt templates handle spaces in the dataset preprocessing pipeline.

Parameters:

tokenizer_instance – Tokenizer instance with a tokenize method
default – Fallback value if computation fails (True for SentencePiece, False for others)

Returns:

True if the tokenizer is space-sensitive, False otherwise

Return type:

bool

.. rubric:: Example

A space-sensitive tokenizer (e.g., many BPE tokenizers):#

tokenize(“x y”) -> [87, 331]#

tokenize(“x”) + tokenize(“y”) -> [87, 379] # Different!#

A non-space-sensitive tokenizer would produce the same result#

class bridge.training.tokenizers.tokenizer.MegatronTokenizer#

Bases: megatron.core.datasets.megatron_tokenizer.MegatronLegacyTokenizer

Base tokenizer class, extending the MegatronTokenizer from megatron core.

This class provides a common interface for various tokenizers used within the NeMo framework.

__call__(*args, **kwargs)#: Makes the tokenizer instance callable, synonym for tokenize.

text_to_ids(text: str) → list[int]#: Converts text to a list of token IDs.

property eod_id#: ID for the end-of-document token.

property bos_id#: ID for the beginning-of-sentence token.

property eos_id#: ID for the end-of-sentence token.

property mask_id#: ID for the mask token.

bridge.training.tokenizers.tokenizer.build_tokenizer(

tokenizer_config: megatron.bridge.training.tokenizers.config.TokenizerConfig,

**kwargs,

) → bridge.training.tokenizers.tokenizer.MegatronTokenizer#

Initialize tokenizer based on the provided configuration.

This function serves as a factory to instantiate various tokenizer types supported by NeMo Framework, such as BERT, GPT2, SentencePiece, HuggingFace, etc. It also handles padding the vocabulary size to be GPU-friendly.

Parameters:

tokenizer_config (TokenizerConfig) – Configuration object specifying the tokenizer type, paths to vocab/model files, and other tokenizer-specific settings.
**kwargs – Additional keyword arguments that might be specific to certain tokenizers (e.g., passed to HuggingFace AutoTokenizer). These will override any kwargs specified in tokenizer_config.hf_tokenizer_kwargs.

Returns:

An instance of the initialized tokenizer.

Return type:

MegatronTokenizer

Raises:

NotImplementedError – If the specified tokenizer_type in tokenizer_config is not supported.
ImportError – If a required library (e.g., transformers for MultimodalTokenizer) is not installed.

class bridge.training.tokenizers.tokenizer._HuggingFaceTokenizer(pretrained_model_name_or_path, **kwargs)#

Bases: bridge.training.tokenizers.tokenizer.MegatronTokenizer

property vocab_size#: Returns the size of the vocabulary.

property vocab#: Returns the vocabulary (token to ID mapping).

property inv_vocab#: Returns the inverse vocabulary (ID to token mapping).

property decoder#: Alias for inv_vocab, for compatibility.

tokenize(text, **kwargs)#: Tokenizes a string of text into a list of token IDs.

detokenize(token_ids, **kwargs)#: Converts a list of token IDs back into a string.

offsets(ids: list[int], text: str) → list[int]#: Calculates the character offsets for each token ID in the given text.

property eod#: Returns the end-of-document token ID.

property bos#: Returns the beginning-of-sentence token ID.

property eos#: Returns the end-of-sentence token ID.

property mask#: Returns the mask token ID.

class bridge.training.tokenizers.tokenizer._BertWordPieceTokenizer( vocab_file, lower_case=True, vocab_extra_ids=0, )#

Bases: bridge.training.tokenizers.tokenizer.MegatronTokenizer

Original BERT wordpiece tokenizer adapted for Megatron.

This tokenizer uses the FullBertTokenizer from bert_tokenization. It handles lower/upper casing and adds special tokens like [CLS], [SEP], [PAD], [MASK], [BOS], and [EOS]. It also supports adding extra vocabulary IDs.

Parameters:

vocab_file (str) – Path to the BERT vocabulary file.
lower_case (bool, optional) – Whether to convert text to lower case. Defaults to True.
vocab_extra_ids (int, optional) – Number of extra IDs to add to the vocabulary, often used for sentinel tokens in T5-style models. Defaults to 0.

Initialization

add_token(token)#: Adds a single token to the vocabulary if it doesn’t already exist.

add_additional_special_tokens(tokens_list)#: Adds a list of special tokens to the vocabulary.

property vocab_size#: Returns the current size of the vocabulary.

property vocab#: Returns the vocabulary (token to ID mapping).

property inv_vocab#: Returns the inverse vocabulary (ID to token mapping).

tokenize(text)#: Tokenizes a string of text into a list of token IDs.

decode(ids)#: Converts a list of token IDs back to a string, cleaning up ## prefixes.

detokenize(token_ids)#: Converts a list of token IDs back to a string. Alias for decode().

decode_token_ids(token_ids)#: Converts token IDs to a string, excluding [PAD] and [CLS] and handling ## prefixes.

property cls#: Returns the [CLS] token ID.

property sep#: Returns the [SEP] token ID.

property pad#: Returns the [PAD] token ID.

property mask#: Returns the [MASK] token ID.

property bos#: Returns the beginning-of-sentence ([BOS]) token ID.

property eos#: Returns the end-of-sentence token ID.

property eod#: Alias for eos, as BERT models typically use EOS for end-of-document.

property bos_token#: Returns the beginning-of-sentence token string ([BOS]).

property eos_token#: Returns the end-of-sentence token string ([EOS]).

property additional_special_tokens#: Returns a list of additional special token strings added to the tokenizer.

property additional_special_tokens_ids#: Returns a list of IDs for the additional special tokens.

class bridge.training.tokenizers.tokenizer._GPT2BPETokenizer(vocab_file, merge_file)#

Bases: bridge.training.tokenizers.tokenizer.MegatronTokenizer

Original GPT-2 BPE tokenizer adapted for Megatron.

This tokenizer uses the GPT2Tokenizer from gpt2_tokenization. It handles BPE tokenization based on a vocabulary file and a merges file. The primary special token is <|endoftext|>.

Parameters:

vocab_file (str) – Path to the GPT-2 vocabulary file (e.g., vocab.json).
merge_file (str) – Path to the GPT-2 merges file (e.g., merges.txt).

Initialization

property vocab_size#: Returns the size of the vocabulary.

property vocab#: Returns the vocabulary (token to ID mapping).

property inv_vocab#: Returns the inverse vocabulary (ID to token mapping).

tokenize(text)#: Tokenizes a string of text into a list of token IDs.

detokenize(token_ids)#: Converts a list of token IDs back into a string.

property eod#: Returns the end-of-document (<|endoftext|>) token ID.

property eos#: Returns the EOS token ID.

class bridge.training.tokenizers.tokenizer._SentencePieceTokenizer(model_file, vocab_extra_ids=0)#

Bases: bridge.training.tokenizers.tokenizer.MegatronTokenizer

A wrapper for SentencePiece tokenizers used with Megatron.

This class interfaces with a pre-trained SentencePiece model. It defines and manages several special tokens such as , , , , , , and . It also supports adding extra vocabulary IDs, typically for T5-style sentinel tokens.

Parameters:

model_file (str) – Path to the SentencePiece model file (e.g., tokenizer.model).
vocab_extra_ids (int, optional) – Number of extra IDs to add to the vocabulary. Defaults to 0.

Initialization

_populate_vocab()#

_initalize(vocab_extra_ids)#

property vocab_size#: Returns the current size of the vocabulary, including added special tokens.

property vocab#: Returns the vocabulary (token to ID mapping).

property inv_vocab#: Returns the inverse vocabulary (ID to token mapping).

property decoder#: Alias for inv_vocab.

property encoder#: Alias for vocab.

tokenize(text)#

Tokenizes a string, handling special tokens separately.

This method first finds occurrences of special tokens (defined during initialization) and tokenizes the text segments around them using the SentencePiece model. Special tokens are inserted as their pre-defined IDs.

Parameters:: text (str) – The input string to tokenize.
Returns:: A list of token IDs.
Return type:: list[int]

detokenize(ids)#

Converts a list of token IDs back to a string, handling special tokens.

This method reconstructs the text by decoding segments of regular token IDs using the SentencePiece model and inserting the string representations of special tokens where their IDs appear.

Parameters:: ids (list[int]) – A list of token IDs.
Returns:: The detokenized string.
Return type:: str

offsets(ids: list[int], text: str) → list[int]#: Calculates the character starting offsets for each token ID.

property cls#: Returns the token ID.

property sep#: Returns the token ID.

property pad#: Returns the padding token ID (e.g., ).

property bos#: Returns the beginning-of-sentence token ID (e.g., ).

property eod#: Returns the end-of-document () token ID.

property eos#: Returns the end-of-sentence token ID (e.g., ).

property mask#: Returns the token ID.

property additional_special_tokens_ids#: Returns a list of IDs for T5-style <extra_id_*> sentinel tokens.

class bridge.training.tokenizers.tokenizer._GPTSentencePieceTokenizer(model_file)#

Bases: bridge.training.tokenizers.tokenizer._SentencePieceTokenizer

A specialized SentencePiece tokenizer for GPT-style models.

This class inherits from _SentencePieceTokenizer but simplifies the special token handling. It primarily uses the BOS, EOS, and PAD IDs defined by the SentencePiece model itself, without adding extra tokens like , , etc. The eod (end-of-document) token is mapped to the eos_id.

Parameters:: model_file (str) – Path to the SentencePiece model file.

Initialization

_initalize(vocab_extra_ids)#

tokenize(text)#: Tokenizes a string of text directly using SentencePiece encode_as_ids.

detokenize(ids)#: Converts a list of token IDs back to a string using SentencePiece decode_ids.

property cls#: Returns -1 as [CLS] is not typically used in this tokenizer.

property sep#: Returns -1 as [SEP] is not typically used in this tokenizer.

property mask#: Returns -1 as [MASK] is not typically used in this tokenizer.

property eod#: Returns the end-of-sentence token ID, used as end-of-document.

property eos#: Returns the EOS token ID.

property additional_special_tokens_ids#: Returns None as no additional special tokens are added by default.

class bridge.training.tokenizers.tokenizer._Llama2Tokenizer(model_file)#

Bases: bridge.training.tokenizers.tokenizer._SentencePieceTokenizer

A tokenizer specifically for Llama-2 style models, using SentencePiece. This class inherits from _SentencePieceTokenizer and is configured for Llama-2’s specific use of BOS and EOS tokens. It uses the BOS/EOS/PAD IDs directly from the SentencePiece model.

Parameters:: model_file (str) – Path to the SentencePiece model file for Llama-2.

Initialization

tokenize(s: str, bos=True, eos=False)#

Tokenizes a string, with options to add BOS and EOS tokens.

Parameters:

s (str) – The input string to tokenize.
bos (bool, optional) – Whether to prepend the BOS token. Defaults to True.
eos (bool, optional) – Whether to append the EOS token. Defaults to False.

Returns:

A list of token IDs.

Return type:

list[int]

detokenize(ids)#: Converts a list of token IDs back into a string.

property cls#: Returns -1 as [CLS] is not typically used in this tokenizer.

property sep#: Returns -1 as [SEP] is not typically used in this tokenizer.

property mask#: Returns -1 as [MASK] is not typically used in this tokenizer.

property eod#: Returns the end-of-sentence token ID, used as end-of-document.

property eos#: Returns the EOS token ID.

property additional_special_tokens_ids#: Returns None as no additional special tokens are added by default.

bridge.training.tokenizers.tokenizer.reload_mergeable_ranks( path: str, max_vocab: Optional[int] = None, ) → Dict[bytes, int]#

Reloads a tokenizer vocabulary from a JSON file (NeMo format) and converts it into the mergeable ranks format required by Tiktoken. The input JSON file is expected to be a list of dictionaries, each with “rank”, “token_bytes” (base64 encoded), and “token_str” keys.

Parameters:

path (str) – Path to the JSON vocabulary file.
max_vocab (Optional[int], optional) – If provided, truncates the vocabulary to this maximum size. Defaults to None.

Returns:

A dictionary mapping token bytes to their ranks.

Return type:

Dict[bytes, int]

class bridge.training.tokenizers.tokenizer.CustomTikTokenizer( path: str, pattern: str, vocab_size: Optional[int], num_special_tokens: int, special_tokens: Optional[List[str]], )#

Bases: bridge.training.tokenizers.tokenizer.MegatronTokenizer

A custom tokenizer using the Tiktoken library with a NeMo-style vocabulary file. This tokenizer loads a vocabulary from a JSON file (processed by reload_mergeable_ranks) and uses it with Tiktoken for encoding and decoding. It supports a configurable number of special tokens, which are placed at the beginning of the vocabulary ID space.

Parameters:

path (str) – Path to the JSON vocabulary file (NeMo format).
pattern (str) – The regex pattern string for Tiktoken.
vocab_size (Optional[int]) – The target vocabulary size. If None, defaults to 2^17.
num_special_tokens (int) – The total number of special tokens to reserve.
special_tokens (Optional[List[str]]) – A list of initial special token strings. Must include “”, “~~”, “~~”. If shorter than num_special_tokens, it will be padded with “<SPECIAL_id>”.

Initialization

property bos: int#: Returns the beginning-of-sentence (~~) token ID.~~

property eos: int#

property unk: int#: Returns the unknown () token ID.

property eod: int#: Returns the end-of-document token ID (same as EOS for this tokenizer).

property vocab#: Returns the vocabulary (token string/bytes to ID mapping).

property inv_vocab#: Returns the inverse vocabulary (ID to token string/bytes mapping).

tokenize( s: str, bos: bool = False, eos: bool = False, ) → List[int]#

Tokenizes a string, with options to add BOS and EOS tokens.

Parameters:

s (str) – The input string to tokenize.
bos (bool, optional) – Whether to prepend the BOS token. Defaults to False.
eos (bool, optional) – Whether to append the EOS token. Defaults to False.

Returns:

A list of token IDs.

Return type:

List[int]

detokenize(tokens: List[int]) → str#: Converts a list of token IDs back into a string.

offsets(ids: list[int], text: str) → list[int]#: Calculates the character starting offsets for each token ID.

property vocab_size: int#: Returns the total vocabulary size, including special tokens.

property encoder#: Alias for vocab.

property decoder#: Alias for inv_vocab.

class bridge.training.tokenizers.tokenizer._NullTokenizer(vocab_size)#

Bases: bridge.training.tokenizers.tokenizer.MegatronTokenizer

A simple tokenizer that splits text by spaces and converts tokens to integers. This tokenizer is primarily for testing or placeholder purposes where actual linguistic tokenization is not required. It assumes tokens are space-separated integers.

Parameters:: vocab_size (int) – The vocabulary size, excluding the EOD token. The EOD token will be assigned vocab_size as its ID.

Initialization

tokenize(text)#: Tokenizes by splitting the string by spaces and converting parts to integers.

detokenize(ids)#: Converts a list of integer IDs back to a space-separated string.

offsets(ids: list[int], text: str) → list[int]#: Calculates character offsets, assuming space-separated integer tokens.

property vocab_size#: Returns the vocabulary size, including the EOD token.

abstract property vocab#: Not implemented for NullTokenizer.

abstract property inv_vocab#: Not implemented for NullTokenizer.

property cls#: Returns -1 as [CLS] is not used.

property sep#: Returns -1 as [SEP] is not used.

property mask#: Returns -1 as [MASK] is not used.

property eod#: Returns the end-of-document token ID.

property eos#

property additional_special_tokens_ids#: Returns None as no additional special tokens are used.

class bridge.training.tokenizers.tokenizer._NullMultimodalTokenizer( vocab_size, image_token=None, image_token_id=None, )#

Bases: megatron.core.datasets.megatron_tokenizer.MegatronLegacyTokenizer

Initialization

tokenize(text)#

detokenize(ids)#

offsets(ids: list[int], text: str) → list[int]#

convert_tokens_to_ids(tokens)#

property vocab_size#

abstract property vocab#

abstract property inv_vocab#

property cls#

property sep#

property mask#

property eod#

property eos#

property additional_special_tokens_ids#

bridge.training.tokenizers.tokenizer#

Module Contents#

Classes#

Functions#

API#

A space-sensitive tokenizer (e.g., many BPE tokenizers):#

tokenize(“x y”) -> [87, 331]#

tokenize(“x”) + tokenize(“y”) -> [87, 379] # Different!#

A non-space-sensitive tokenizer would produce the same result#

`bridge.training.tokenizers.tokenizer`#