nemo_curator.stages.text.classifiers.prompt_task_complexity
nemo_curator.stages.text.classifiers.prompt_task_complexity
Module Contents
Classes
Data
PROMPT_TASK_COMPLEXITY_MODEL_IDENTIFIER
API
Bases: Module, PyTorchModelHubMixin
Bases: Module
Bases: Module
Bases: CompositeStage[DocumentBatch, DocumentBatch]
PromptTaskComplexityClassifier is a multi-headed model which classifies English text prompts across task types and complexity dimensions. Tasks are classified across 11 common categories. Complexity is evaluated across 6 dimensions and ensembled to create an overall complexity score. Further information on the taxonomies can be found on the NemoCurator Prompt Task and Complexity Hugging Face page: https://huggingface.co/nvidia/prompt-task-and-complexity-classifier. This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.
Parameters:
The Hugging Face cache directory. Defaults to None.
The name of the text field in the input data. Defaults to “text”.
For categorical classifiers, the list of labels to filter the data by. Defaults to None. Not supported with PromptTaskComplexityClassifier (raises NotImplementedError).
Limits the total number of characters that can be fed to the tokenizer. If None, text will not be truncated. Defaults to None.
Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True.
The size of the batch for model inference. Defaults to 256.
Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.
Whether to keep the input tokens in the output dataframe. Defaults to False.
Whether to use the existing tokens from the input dataframe. If True, assume the relevant token fields are [“input_ids”, “attention_mask”] and skip tokenization. Defaults to False.
Bases: ModelStage
Stage for Hugging Face model inference.
Parameters:
The Hugging Face cache directory. Defaults to None.
The size of the batch for model inference. Defaults to 256.
Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True.
If provided, clips the input tokens before the forward pass. Defaults to None.
Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.
Whether to keep the input tokens in the output dataframe. Defaults to False.