nemo_curator.stages.text.models.tokenizer
nemo_curator.stages.text.models.tokenizer
Module Contents
Classes
API
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
Tokenizer stage for Hugging Face models.
Parameters:
The identifier of the Hugging Face model.
The Hugging Face cache directory. Defaults to None.
Hugging Face token for downloading the model, if needed. Defaults to None.
The name of the text field in the input data. Defaults to “text”.
Limits the total number of characters that can be fed to the tokenizer. If None, text will not be truncated. Defaults to None.
Limits the total sequence returned by the tokenizer so that it has a maximum length. If None, the tokenizer’s model_max_length is used. Defaults to None.
The side to pad the input tokens. Defaults to “right”.
Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True.
If True, set the pad_token to the tokenizer’s unk_token. Defaults to False.