stages.text.classifiers.prompt_task_complexity#

Module Contents#

Classes#

CustomDeberta

MeanPooling

MulticlassHead

PromptTaskComplexityClassifier

PromptTaskComplexityClassifier is a multi-headed model which classifies English text prompts across task types and complexity dimensions. Tasks are classified across 11 common categories. Complexity is evaluated across 6 dimensions and ensembled to create an overall complexity score. Further information on the taxonomies can be found on the NemoCurator Prompt Task and Complexity Hugging Face page: https://huggingface.co/nvidia/prompt-task-and-complexity-classifier. This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

PromptTaskComplexityModelStage

Stage for Hugging Face model inference.

Data#

API#

class stages.text.classifiers.prompt_task_complexity.CustomDeberta(config: dataclasses.dataclass)#

Bases: torch.nn.Module, huggingface_hub.PyTorchModelHubMixin

Initialization

compute_results(
preds: torch.Tensor,
target: str,
decimal: int = 4,
) tuple[list[str], list[str], list[float]]#
property device: torch.device#
forward(batch: dict[str, torch.Tensor]) dict[str, torch.Tensor]#
process_logits(logits: list[torch.Tensor]) dict[str, torch.Tensor]#
stages.text.classifiers.prompt_task_complexity.MAX_SEQ_LENGTH#

512

class stages.text.classifiers.prompt_task_complexity.MeanPooling#

Bases: torch.nn.Module

Initialization

forward(
last_hidden_state: torch.Tensor,
attention_mask: torch.Tensor,
) torch.Tensor#
class stages.text.classifiers.prompt_task_complexity.MulticlassHead(input_size: int, num_classes: int)#

Bases: torch.nn.Module

Initialization

forward(x: torch.Tensor) torch.Tensor#
stages.text.classifiers.prompt_task_complexity.OUTPUT_FIELDS#

[‘prompt_complexity_score’, ‘task_type_1’, ‘task_type_2’, ‘task_type_prob’, ‘creativity_scope’, ‘rea…

stages.text.classifiers.prompt_task_complexity.PROMPT_TASK_COMPLEXITY_MODEL_IDENTIFIER#

‘nvidia/prompt-task-and-complexity-classifier’

class stages.text.classifiers.prompt_task_complexity.PromptTaskComplexityClassifier#

Bases: nemo_curator.stages.base.CompositeStage[nemo_curator.tasks.DocumentBatch, nemo_curator.tasks.DocumentBatch]

PromptTaskComplexityClassifier is a multi-headed model which classifies English text prompts across task types and complexity dimensions. Tasks are classified across 11 common categories. Complexity is evaluated across 6 dimensions and ensembled to create an overall complexity score. Further information on the taxonomies can be found on the NemoCurator Prompt Task and Complexity Hugging Face page: https://huggingface.co/nvidia/prompt-task-and-complexity-classifier. This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

Args: cache_dir: The Hugging Face cache directory. Defaults to None. text_field: The name of the text field in the input data. Defaults to “text”. filter_by: For categorical classifiers, the list of labels to filter the data by. Defaults to None. Not supported with PromptTaskComplexityClassifier (raises NotImplementedError). max_chars: Limits the total number of characters that can be fed to the tokenizer. If None, text will not be truncated. Defaults to 2000. sort_by_length: Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True. model_inference_batch_size: The size of the batch for model inference. Defaults to 256. autocast: Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.

Initialization

autocast: bool#

True

cache_dir: str | None#

None

decompose() list[nemo_curator.stages.base.ProcessingStage]#

Decompose into execution stages.

This method must be implemented by composite stages to define what low-level stages they represent.

Returns (list[ProcessingStage]): List of execution stages that will actually run

filter_by: list[str] | None#

None

inputs() tuple[list[str], list[str]]#

Get the inputs for this stage.

max_chars: int#

2000

model_inference_batch_size: int#

256

outputs() tuple[list[str], list[str]]#

Get the outputs for this stage.

sort_by_length: bool#

True

text_field: str#

‘text’

class stages.text.classifiers.prompt_task_complexity.PromptTaskComplexityModelStage(
cache_dir: str | None = None,
model_inference_batch_size: int = 256,
has_seq_order: bool = True,
autocast: bool = True,
)#

Bases: nemo_curator.stages.text.models.model.ModelStage

Stage for Hugging Face model inference.

Args: cache_dir: The Hugging Face cache directory. Defaults to None. model_inference_batch_size: The size of the batch for model inference. Defaults to 256. has_seq_order: Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True. autocast: Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.

Initialization

create_output_dataframe(
df_cpu: pandas.DataFrame,
collected_output: dict[str, numpy.ndarray],
) pandas.DataFrame#
outputs() tuple[list[str], list[str]]#

Define stage output specification.

Returns (tuple[list[str], list[str]]): Tuple of (output_attributes, output_columns) where: - output_top_level_attributes: List of task attributes this stage adds/modifies - output_data_attributes: List of attributes within the data that this stage adds/modifies

process_model_output(
outputs: torch.Tensor,
_: dict[str, torch.Tensor] | None = None,
) torch.Tensor#