bridge.training.tokenizers.tokenizer
#
Megatron tokenizers.
Module Contents#
Classes#
Base tokenizer class, extending the MegatronTokenizer from megatron core. |
|
Original BERT wordpiece tokenizer adapted for Megatron. |
|
Original GPT-2 BPE tokenizer adapted for Megatron. |
|
A wrapper for SentencePiece tokenizers used with Megatron. |
|
A specialized SentencePiece tokenizer for GPT-style models. |
|
A tokenizer specifically for Llama-2 style models, using SentencePiece.
This class inherits from |
|
A custom tokenizer using the Tiktoken library with a NeMo-style vocabulary file.
This tokenizer loads a vocabulary from a JSON file (processed by
|
|
A simple tokenizer that splits text by spaces and converts tokens to integers. This tokenizer is primarily for testing or placeholder purposes where actual linguistic tokenization is not required. It assumes tokens are space-separated integers. |
Functions#
Initialize tokenizer based on the provided configuration. |
|
Pad vocab size so it is divisible by model parallel size and still having GPU friendly size. |
|
Reloads a tokenizer vocabulary from a JSON file (NeMo format) and converts it into the mergeable ranks format required by Tiktoken. The input JSON file is expected to be a list of dictionaries, each with “rank”, “token_bytes” (base64 encoded), and “token_str” keys. |
API#
- class bridge.training.tokenizers.tokenizer.MegatronTokenizer#
Bases:
megatron.core.datasets.megatron_tokenizer.MegatronTokenizer
Base tokenizer class, extending the MegatronTokenizer from megatron core.
This class provides a common interface for various tokenizers used within the NeMo framework.
- __call__(*args, **kwargs)#
Makes the tokenizer instance callable, synonym for
tokenize
.
- text_to_ids(text: str) list[int] #
Converts text to a list of token IDs.
- property eod_id#
ID for the end-of-document token.
- property bos_id#
ID for the beginning-of-sentence token.
- property eos_id#
ID for the end-of-sentence token.
- property mask_id#
ID for the mask token.
- bridge.training.tokenizers.tokenizer.build_tokenizer(
- tokenizer_config: megatron.bridge.training.tokenizers.config.TokenizerConfig,
- make_vocab_size_divisible_by: int,
- tensor_model_parallel_size: int,
- **kwargs,
Initialize tokenizer based on the provided configuration.
This function serves as a factory to instantiate various tokenizer types supported by NeMo, such as BERT, GPT2, SentencePiece, HuggingFace, etc. It also handles padding the vocabulary size to be GPU-friendly.
- Parameters:
tokenizer_config (TokenizerConfig) – Configuration object specifying the tokenizer type, paths to vocab/model files, and other tokenizer-specific settings.
make_vocab_size_divisible_by (int) – Ensures the vocabulary size is a multiple of this value.
tensor_model_parallel_size (int) – The tensor model parallel size, used for further adjusting vocabulary size for distributed training.
**kwargs – Additional keyword arguments that might be specific to certain tokenizers (e.g., passed to HuggingFace AutoTokenizer).
- Returns:
An instance of the initialized tokenizer.
- Return type:
- Raises:
NotImplementedError – If the specified tokenizer_type in tokenizer_config is not supported.
ImportError – If a required library (e.g., transformers for MultimodalTokenizer) is not installed.
- bridge.training.tokenizers.tokenizer._vocab_size_with_padding(
- orig_vocab_size: int,
- make_vocab_size_divisible_by: int,
- tensor_model_parallel_size: int,
- logging_enabled: bool = True,
Pad vocab size so it is divisible by model parallel size and still having GPU friendly size.
- class bridge.training.tokenizers.tokenizer._HuggingFaceTokenizer(pretrained_model_name_or_path, **kwargs)#
Bases:
bridge.training.tokenizers.tokenizer.MegatronTokenizer
- property vocab_size#
Returns the size of the vocabulary.
- property vocab#
Returns the vocabulary (token to ID mapping).
- property inv_vocab#
Returns the inverse vocabulary (ID to token mapping).
- property decoder#
Alias for inv_vocab, for compatibility.
- tokenize(text, **kwargs)#
Tokenizes a string of text into a list of token IDs.
- detokenize(token_ids, **kwargs)#
Converts a list of token IDs back into a string.
- offsets(ids: list[int], text: str) list[int] #
Calculates the character offsets for each token ID in the given text.
- property eod#
Returns the end-of-document token ID.
- property bos#
Returns the beginning-of-sentence token ID.
- property eos#
Returns the end-of-sentence token ID.
- property mask#
Returns the mask token ID.
- class bridge.training.tokenizers.tokenizer._BertWordPieceTokenizer(
- vocab_file,
- lower_case=True,
- vocab_extra_ids=0,
Bases:
bridge.training.tokenizers.tokenizer.MegatronTokenizer
Original BERT wordpiece tokenizer adapted for Megatron.
This tokenizer uses the
FullBertTokenizer
frombert_tokenization
. It handles lower/upper casing and adds special tokens like [CLS], [SEP], [PAD], [MASK], [BOS], and [EOS]. It also supports adding extra vocabulary IDs.- Parameters:
vocab_file (str) – Path to the BERT vocabulary file.
lower_case (bool, optional) – Whether to convert text to lower case. Defaults to True.
vocab_extra_ids (int, optional) – Number of extra IDs to add to the vocabulary, often used for sentinel tokens in T5-style models. Defaults to 0.
Initialization
- add_token(token)#
Adds a single token to the vocabulary if it doesn’t already exist.
- add_additional_special_tokens(tokens_list)#
Adds a list of special tokens to the vocabulary.
- property vocab_size#
Returns the current size of the vocabulary.
- property vocab#
Returns the vocabulary (token to ID mapping).
- property inv_vocab#
Returns the inverse vocabulary (ID to token mapping).
- tokenize(text)#
Tokenizes a string of text into a list of token IDs.
- decode(ids)#
Converts a list of token IDs back to a string, cleaning up ## prefixes.
- detokenize(token_ids)#
Converts a list of token IDs back to a string. Alias for decode().
- decode_token_ids(token_ids)#
Converts token IDs to a string, excluding [PAD] and [CLS] and handling ## prefixes.
- property cls#
Returns the [CLS] token ID.
- property sep#
Returns the [SEP] token ID.
- property pad#
Returns the [PAD] token ID.
- property mask#
Returns the [MASK] token ID.
- property bos#
Returns the beginning-of-sentence ([BOS]) token ID.
- property eos#
Returns the end-of-sentence token ID.
- property eod#
Alias for eos, as BERT models typically use EOS for end-of-document.
- property bos_token#
Returns the beginning-of-sentence token string ([BOS]).
- property eos_token#
Returns the end-of-sentence token string ([EOS]).
- property additional_special_tokens#
Returns a list of additional special token strings added to the tokenizer.
- property additional_special_tokens_ids#
Returns a list of IDs for the additional special tokens.
- class bridge.training.tokenizers.tokenizer._GPT2BPETokenizer(vocab_file, merge_file)#
Bases:
bridge.training.tokenizers.tokenizer.MegatronTokenizer
Original GPT-2 BPE tokenizer adapted for Megatron.
This tokenizer uses the
GPT2Tokenizer
fromgpt2_tokenization
. It handles BPE tokenization based on a vocabulary file and a merges file. The primary special token is <|endoftext|>.- Parameters:
vocab_file (str) – Path to the GPT-2 vocabulary file (e.g., vocab.json).
merge_file (str) – Path to the GPT-2 merges file (e.g., merges.txt).
Initialization
- property vocab_size#
Returns the size of the vocabulary.
- property vocab#
Returns the vocabulary (token to ID mapping).
- property inv_vocab#
Returns the inverse vocabulary (ID to token mapping).
- tokenize(text)#
Tokenizes a string of text into a list of token IDs.
- detokenize(token_ids)#
Converts a list of token IDs back into a string.
- property eod#
Returns the end-of-document (<|endoftext|>) token ID.
- class bridge.training.tokenizers.tokenizer._SentencePieceTokenizer(model_file, vocab_extra_ids=0)#
Bases:
bridge.training.tokenizers.tokenizer.MegatronTokenizer
A wrapper for SentencePiece tokenizers used with Megatron.
This class interfaces with a pre-trained SentencePiece model. It defines and manages several special tokens such as
, , , , , , and . It also supports adding extra vocabulary IDs, typically for T5-style sentinel tokens. - Parameters:
model_file (str) – Path to the SentencePiece model file (e.g., tokenizer.model).
vocab_extra_ids (int, optional) – Number of extra IDs to add to the vocabulary. Defaults to 0.
Initialization
- _populate_vocab()#
- _initalize(vocab_extra_ids)#
- property vocab_size#
Returns the current size of the vocabulary, including added special tokens.
- property vocab#
Returns the vocabulary (token to ID mapping).
- property inv_vocab#
Returns the inverse vocabulary (ID to token mapping).
- property decoder#
Alias for inv_vocab.
- property encoder#
Alias for vocab.
- tokenize(text)#
Tokenizes a string, handling special tokens separately.
This method first finds occurrences of special tokens (defined during initialization) and tokenizes the text segments around them using the SentencePiece model. Special tokens are inserted as their pre-defined IDs.
- Parameters:
text (str) – The input string to tokenize.
- Returns:
A list of token IDs.
- Return type:
list[int]
- detokenize(ids)#
Converts a list of token IDs back to a string, handling special tokens.
This method reconstructs the text by decoding segments of regular token IDs using the SentencePiece model and inserting the string representations of special tokens where their IDs appear.
- Parameters:
ids (list[int]) – A list of token IDs.
- Returns:
The detokenized string.
- Return type:
str
- offsets(ids: list[int], text: str) list[int] #
Calculates the character starting offsets for each token ID.
- property cls#
Returns the
token ID.
- property sep#
Returns the
token ID.
- property pad#
Returns the padding token ID (e.g.,
).
- property bos#
Returns the beginning-of-sentence token ID (e.g.,
).
- property eod#
Returns the end-of-document (
) token ID.
- property eos#
Returns the end-of-sentence token ID (e.g.,
).
- property mask#
Returns the
token ID.
- property additional_special_tokens_ids#
Returns a list of IDs for T5-style <extra_id_*> sentinel tokens.
- class bridge.training.tokenizers.tokenizer._GPTSentencePieceTokenizer(model_file)#
Bases:
bridge.training.tokenizers.tokenizer._SentencePieceTokenizer
A specialized SentencePiece tokenizer for GPT-style models.
This class inherits from
_SentencePieceTokenizer
but simplifies the special token handling. It primarily uses the BOS, EOS, and PAD IDs defined by the SentencePiece model itself, without adding extra tokens like, , etc. The eod
(end-of-document) token is mapped to theeos_id
.- Parameters:
model_file (str) – Path to the SentencePiece model file.
Initialization
- _initalize(vocab_extra_ids)#
- tokenize(text)#
Tokenizes a string of text directly using SentencePiece encode_as_ids.
- detokenize(ids)#
Converts a list of token IDs back to a string using SentencePiece decode_ids.
- property cls#
Returns -1 as [CLS] is not typically used in this tokenizer.
- property sep#
Returns -1 as [SEP] is not typically used in this tokenizer.
- property mask#
Returns -1 as [MASK] is not typically used in this tokenizer.
- property eod#
Returns the end-of-sentence token ID, used as end-of-document.
- property additional_special_tokens_ids#
Returns None as no additional special tokens are added by default.
- class bridge.training.tokenizers.tokenizer._Llama2Tokenizer(model_file)#
Bases:
bridge.training.tokenizers.tokenizer._SentencePieceTokenizer
A tokenizer specifically for Llama-2 style models, using SentencePiece. This class inherits from
_SentencePieceTokenizer
and is configured for Llama-2’s specific use of BOS and EOS tokens. It uses the BOS/EOS/PAD IDs directly from the SentencePiece model.- Parameters:
model_file (str) – Path to the SentencePiece model file for Llama-2.
Initialization
- tokenize(s: str, bos=True, eos=False)#
Tokenizes a string, with options to add BOS and EOS tokens.
- Parameters:
s (str) – The input string to tokenize.
bos (bool, optional) – Whether to prepend the BOS token. Defaults to True.
eos (bool, optional) – Whether to append the EOS token. Defaults to False.
- Returns:
A list of token IDs.
- Return type:
list[int]
- detokenize(ids)#
Converts a list of token IDs back into a string.
- property cls#
Returns -1 as [CLS] is not typically used in this tokenizer.
- property sep#
Returns -1 as [SEP] is not typically used in this tokenizer.
- property mask#
Returns -1 as [MASK] is not typically used in this tokenizer.
- property eod#
Returns the end-of-sentence token ID, used as end-of-document.
- property additional_special_tokens_ids#
Returns None as no additional special tokens are added by default.
- bridge.training.tokenizers.tokenizer.reload_mergeable_ranks(
- path: str,
- max_vocab: Optional[int] = None,
Reloads a tokenizer vocabulary from a JSON file (NeMo format) and converts it into the mergeable ranks format required by Tiktoken. The input JSON file is expected to be a list of dictionaries, each with “rank”, “token_bytes” (base64 encoded), and “token_str” keys.
- Parameters:
path (str) – Path to the JSON vocabulary file.
max_vocab (Optional[int], optional) – If provided, truncates the vocabulary to this maximum size. Defaults to None.
- Returns:
A dictionary mapping token bytes to their ranks.
- Return type:
Dict[bytes, int]
- class bridge.training.tokenizers.tokenizer.CustomTikTokenizer(
- path: str,
- pattern: str,
- vocab_size: Optional[int],
- num_special_tokens: int,
- special_tokens: Optional[List[str]],
Bases:
bridge.training.tokenizers.tokenizer.MegatronTokenizer
A custom tokenizer using the Tiktoken library with a NeMo-style vocabulary file. This tokenizer loads a vocabulary from a JSON file (processed by
reload_mergeable_ranks
) and uses it with Tiktoken for encoding and decoding. It supports a configurable number of special tokens, which are placed at the beginning of the vocabulary ID space.- Parameters:
path (str) – Path to the JSON vocabulary file (NeMo format).
pattern (str) – The regex pattern string for Tiktoken.
vocab_size (Optional[int]) – The target vocabulary size. If None, defaults to 2^17.
num_special_tokens (int) – The total number of special tokens to reserve.
special_tokens (Optional[List[str]]) – A list of initial special token strings. Must include “
”, “ ”, “”. If shorter thannum_special_tokens
, it will be padded with “<SPECIAL_id>”.
Initialization
- property bos: int#
Returns the beginning-of-sentence (
) token ID.
- property eos: int#
Returns the end-of-sentence () token ID.
- property unk: int#
Returns the unknown (
) token ID.
- property eod: int#
Returns the end-of-document token ID (same as EOS for this tokenizer).
- property vocab#
Returns the vocabulary (token string/bytes to ID mapping).
- property inv_vocab#
Returns the inverse vocabulary (ID to token string/bytes mapping).
- tokenize(
- s: str,
- bos: bool = False,
- eos: bool = False,
Tokenizes a string, with options to add BOS and EOS tokens.
- Parameters:
s (str) – The input string to tokenize.
bos (bool, optional) – Whether to prepend the BOS token. Defaults to False.
eos (bool, optional) – Whether to append the EOS token. Defaults to False.
- Returns:
A list of token IDs.
- Return type:
List[int]
- detokenize(tokens: List[int]) str #
Converts a list of token IDs back into a string.
- offsets(ids: list[int], text: str) list[int] #
Calculates the character starting offsets for each token ID.
- property vocab_size: int#
Returns the total vocabulary size, including special tokens.
- property encoder#
Alias for vocab.
- property decoder#
Alias for inv_vocab.
- class bridge.training.tokenizers.tokenizer._NullTokenizer(vocab_size)#
Bases:
bridge.training.tokenizers.tokenizer.MegatronTokenizer
A simple tokenizer that splits text by spaces and converts tokens to integers. This tokenizer is primarily for testing or placeholder purposes where actual linguistic tokenization is not required. It assumes tokens are space-separated integers.
- Parameters:
vocab_size (int) – The vocabulary size, excluding the EOD token. The EOD token will be assigned
vocab_size
as its ID.
Initialization
- tokenize(text)#
Tokenizes by splitting the string by spaces and converting parts to integers.
- detokenize(ids)#
Converts a list of integer IDs back to a space-separated string.
- offsets(ids: list[int], text: str) list[int] #
Calculates character offsets, assuming space-separated integer tokens.
- property vocab_size#
Returns the vocabulary size, including the EOD token.
- abstract property vocab#
Not implemented for NullTokenizer.
- abstract property inv_vocab#
Not implemented for NullTokenizer.
- property cls#
Returns -1 as [CLS] is not used.
- property sep#
Returns -1 as [SEP] is not used.
- property mask#
Returns -1 as [MASK] is not used.
- property eod#
Returns the end-of-document token ID.
- property additional_special_tokens_ids#
Returns None as no additional special tokens are used.