core.datasets.megatron_tokenizer#

Module Contents#

Classes#

MegatronLegacyTokenizer

Abstract class for tokenizer

Data#

API#

core.datasets.megatron_tokenizer.logger#

‘getLogger(…)’

class core.datasets.megatron_tokenizer.MegatronLegacyTokenizer(
*tokenizer_paths: str,
**tokenizer_options: Any,
)#

Bases: abc.ABC

Abstract class for tokenizer

Absent a config or class-specific tracking of which objects are uniquely identifying, we must include all key word arguments as unique identifiers

Parameters:
  • tokenizer_paths (Tuple[str]) – All tokenizer source paths or prefixes

  • tokenizer_options (Dict[str, Any]) – All tokenizer options

Initialization

abstractmethod tokenize(text: str) numpy.ndarray#

Convert text to embedding ids

Parameters:

text (str) – The text to convert

Returns:

The converted embedding ids

Return type:

numpy.ndarray

abstractmethod detokenize(ids: numpy.ndarray) str#

Convert embedding ids to text

Parameters:

ids (numpy.ndarray) – The ids to convert

Returns:

The converted text

Return type:

str

Raises:

NotImplementedError – Non-abstract, optional method

abstractmethod offsets(ids: list[int], text: str) list[int]#

Convert embedding ids to text offsets

Parameters:
  • ids (list[int]) – The ids to convert

  • text (str) – The text to convert

Returns:

The converted offsets

Return type:

list[int]

Raises:

NotImplementedError – Non-abstract, optional method

abstract property vocab#

Dictionary from vocab text token to id token

abstract property inv_vocab#

Dictionary from vocab id token to text token

abstract property vocab_size#

The vocabulary size

abstract property cls#

The CLS token id

Raises:

NotImplementedError – Non-abstract, optional attribute

abstract property sep#

The SEP token id

Raises:

NotImplementedError – Non-abstract, optional attribute

abstract property pad#

The PAD token id

Raises:

NotImplementedError – Non-abstract, optional attribute

abstract property eod#

The EOD token id

Raises:

NotImplementedError – Non-abstract, optional attribute

abstract property bos#

The BOS token id

Raises:

NotImplementedError – Non-abstract, optional attribute

abstract property eos#

The EOS token id

Raises:

NotImplementedError – Non-abstract, optional attribute

abstract property mask#

The MASK token id

Raises:

NotImplementedError – Non-abstract, optional attribute