core.datasets.megatron_tokenizer#
Module Contents#
Classes#
Abstract class for tokenizer |
Data#
API#
- core.datasets.megatron_tokenizer.logger#
‘getLogger(…)’
- class core.datasets.megatron_tokenizer.MegatronLegacyTokenizer(
- *tokenizer_paths: str,
- **tokenizer_options: Any,
Bases:
abc.ABCAbstract class for tokenizer
Absent a config or class-specific tracking of which objects are uniquely identifying, we must include all key word arguments as unique identifiers
- Parameters:
tokenizer_paths (Tuple[str]) – All tokenizer source paths or prefixes
tokenizer_options (Dict[str, Any]) – All tokenizer options
Initialization
- abstractmethod tokenize(text: str) numpy.ndarray#
Convert text to embedding ids
- Parameters:
text (str) – The text to convert
- Returns:
The converted embedding ids
- Return type:
numpy.ndarray
- abstractmethod detokenize(ids: numpy.ndarray) str#
Convert embedding ids to text
- Parameters:
ids (numpy.ndarray) – The ids to convert
- Returns:
The converted text
- Return type:
str
- Raises:
NotImplementedError – Non-abstract, optional method
- abstractmethod offsets(ids: list[int], text: str) list[int]#
Convert embedding ids to text offsets
- Parameters:
ids (list[int]) – The ids to convert
text (str) – The text to convert
- Returns:
The converted offsets
- Return type:
list[int]
- Raises:
NotImplementedError – Non-abstract, optional method
- abstract property vocab#
Dictionary from vocab text token to id token
- abstract property inv_vocab#
Dictionary from vocab id token to text token
- abstract property vocab_size#
The vocabulary size
- abstract property cls#
The CLS token id
- Raises:
NotImplementedError – Non-abstract, optional attribute
- abstract property sep#
The SEP token id
- Raises:
NotImplementedError – Non-abstract, optional attribute
- abstract property pad#
The PAD token id
- Raises:
NotImplementedError – Non-abstract, optional attribute
- abstract property eod#
The EOD token id
- Raises:
NotImplementedError – Non-abstract, optional attribute
- abstract property bos#
The BOS token id
- Raises:
NotImplementedError – Non-abstract, optional attribute
- abstract property eos#
The EOS token id
- Raises:
NotImplementedError – Non-abstract, optional attribute
- abstract property mask#
The MASK token id
- Raises:
NotImplementedError – Non-abstract, optional attribute