core.tokenizers.text.libraries.null_tokenizer#

Module Contents#

Classes#

NullTokenizer

Synthetic tokenizer for performance benchmarking and debugging

API#

class core.tokenizers.text.libraries.null_tokenizer.NullTokenizer(vocab_size, eod_id=None, pad_id=-1, **kwargs)#

Synthetic tokenizer for performance benchmarking and debugging

Parameters:
  • vocab_size – vocabulary size for embedding

  • eod_id – id of the end-of-document token. Defaults to vocab_size - 1.

  • pad_id – id of the padding token. Defaults to -1 (no pad token).

Initialization

text_to_ids(text)#

Converts text to ids.

ids_to_text(ids)#

Converts ids to text.

tokens_to_ids(tokens)#

Converts tokens to ids.

ids_to_tokens(ids)#

Converts ids to tokens.

offsets(ids: list[int], text: str) list[int]#

Returns offsets.

property unique_identifiers: collections.OrderedDict#

Property required for use with megatron-core datasets.

property vocab_size#

Returns vocab size.

abstract property vocab#
abstract property inv_vocab#
property cls#

Returns cls token.

property sep#

Returns sep token.

property mask#

Returns mask token.

property eod#

Returns eod token.

property pad_id#

Returns pad token.

property additional_special_tokens_ids#