nemo_curator.stages.text.classifiers.prompt_task_complexity

View as Markdown

Module Contents

Classes

NameDescription
CustomDeberta-
MeanPooling-
MulticlassHead-
PromptTaskComplexityClassifierPromptTaskComplexityClassifier is a multi-headed model which classifies English text prompts across task types and complexity dimensions.
PromptTaskComplexityModelStageStage for Hugging Face model inference.

Data

MAX_SEQ_LENGTH

OUTPUT_FIELDS

PROMPT_TASK_COMPLEXITY_MODEL_IDENTIFIER

API

class nemo_curator.stages.text.classifiers.prompt_task_complexity.CustomDeberta(
config: dataclasses.dataclass
)

Bases: Module, PyTorchModelHubMixin

backbone
= AutoModel.from_pretrained(config['base_model'])
device
device
divisor_map
= config['divisor_map']
heads
pool
= MeanPooling()
target_sizes
= config['target_sizes'].values()
task_type_map
= config['task_type_map']
weights_map
= config['weights_map']
nemo_curator.stages.text.classifiers.prompt_task_complexity.CustomDeberta._forward(
input_ids: torch.Tensor,
attention_mask: torch.Tensor
) -> dict[str, torch.Tensor]
nemo_curator.stages.text.classifiers.prompt_task_complexity.CustomDeberta.compute_results(
preds: torch.Tensor,
target: str,
decimal: int = 4
) -> tuple[list[str], list[str], list[float]]
nemo_curator.stages.text.classifiers.prompt_task_complexity.CustomDeberta.forward(
batch: dict[str, torch.Tensor]
) -> dict[str, torch.Tensor]
nemo_curator.stages.text.classifiers.prompt_task_complexity.CustomDeberta.process_logits(
logits: list[torch.Tensor]
) -> dict[str, torch.Tensor]
class nemo_curator.stages.text.classifiers.prompt_task_complexity.MeanPooling()

Bases: Module

nemo_curator.stages.text.classifiers.prompt_task_complexity.MeanPooling.forward(
last_hidden_state: torch.Tensor,
attention_mask: torch.Tensor
) -> torch.Tensor
class nemo_curator.stages.text.classifiers.prompt_task_complexity.MulticlassHead(
input_size: int,
num_classes: int
)

Bases: Module

fc
= nn.Linear(input_size, num_classes)
nemo_curator.stages.text.classifiers.prompt_task_complexity.MulticlassHead.forward(
x: torch.Tensor
) -> torch.Tensor
class nemo_curator.stages.text.classifiers.prompt_task_complexity.PromptTaskComplexityClassifier(
cache_dir: str | None = None,
text_field: str = 'text',
filter_by: list[str] | None = None,
max_chars: int | None = None,
sort_by_length: bool = True,
model_inference_batch_size: int = 256,
autocast: bool = True,
keep_tokens: bool = False,
use_existing_tokens: bool = False
)
Dataclass

Bases: CompositeStage[DocumentBatch, DocumentBatch]

PromptTaskComplexityClassifier is a multi-headed model which classifies English text prompts across task types and complexity dimensions. Tasks are classified across 11 common categories. Complexity is evaluated across 6 dimensions and ensembled to create an overall complexity score. Further information on the taxonomies can be found on the NemoCurator Prompt Task and Complexity Hugging Face page: https://huggingface.co/nvidia/prompt-task-and-complexity-classifier. This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

Parameters:

cache_dir
str | NoneDefaults to None

The Hugging Face cache directory. Defaults to None.

text_field
strDefaults to 'text'

The name of the text field in the input data. Defaults to “text”.

filter_by
list[str] | NoneDefaults to None

For categorical classifiers, the list of labels to filter the data by. Defaults to None. Not supported with PromptTaskComplexityClassifier (raises NotImplementedError).

max_chars
int | NoneDefaults to None

Limits the total number of characters that can be fed to the tokenizer. If None, text will not be truncated. Defaults to None.

sort_by_length
boolDefaults to True

Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True.

model_inference_batch_size
intDefaults to 256

The size of the batch for model inference. Defaults to 256.

autocast
boolDefaults to True

Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.

keep_tokens
boolDefaults to False

Whether to keep the input tokens in the output dataframe. Defaults to False.

use_existing_tokens
boolDefaults to False

Whether to use the existing tokens from the input dataframe. If True, assume the relevant token fields are [“input_ids”, “attention_mask”] and skip tokenization. Defaults to False.

autocast
bool = True
cache_dir
str | None = None
filter_by
list[str] | None = None
keep_tokens
bool = False
max_chars
int | None = None
model_inference_batch_size
int = 256
sort_by_length
bool = True
text_field
str = 'text'
use_existing_tokens
bool = False
nemo_curator.stages.text.classifiers.prompt_task_complexity.PromptTaskComplexityClassifier.__post_init__() -> None
nemo_curator.stages.text.classifiers.prompt_task_complexity.PromptTaskComplexityClassifier.decompose() -> list[nemo_curator.stages.base.ProcessingStage]
nemo_curator.stages.text.classifiers.prompt_task_complexity.PromptTaskComplexityClassifier.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.classifiers.prompt_task_complexity.PromptTaskComplexityClassifier.outputs() -> tuple[list[str], list[str]]
class nemo_curator.stages.text.classifiers.prompt_task_complexity.PromptTaskComplexityModelStage(
cache_dir: str | None = None,
model_inference_batch_size: int = 256,
has_seq_order: bool = True,
max_seq_length: int | None = None,
autocast: bool = True,
keep_tokens: bool = False
)

Bases: ModelStage

Stage for Hugging Face model inference.

Parameters:

cache_dir
str | NoneDefaults to None

The Hugging Face cache directory. Defaults to None.

model_inference_batch_size
intDefaults to 256

The size of the batch for model inference. Defaults to 256.

has_seq_order
boolDefaults to True

Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True.

max_seq_length
int | NoneDefaults to None

If provided, clips the input tokens before the forward pass. Defaults to None.

autocast
boolDefaults to True

Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.

keep_tokens
boolDefaults to False

Whether to keep the input tokens in the output dataframe. Defaults to False.

nemo_curator.stages.text.classifiers.prompt_task_complexity.PromptTaskComplexityModelStage._setup(
local_files_only: bool = True
) -> None
nemo_curator.stages.text.classifiers.prompt_task_complexity.PromptTaskComplexityModelStage.create_output_dataframe(
df_cpu: pandas.DataFrame,
collected_output: dict[str, numpy.ndarray]
) -> pandas.DataFrame
nemo_curator.stages.text.classifiers.prompt_task_complexity.PromptTaskComplexityModelStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.classifiers.prompt_task_complexity.PromptTaskComplexityModelStage.process_model_output(
outputs: torch.Tensor,
_: dict[str, torch.Tensor] | None = None
) -> torch.Tensor
nemo_curator.stages.text.classifiers.prompt_task_complexity.MAX_SEQ_LENGTH = 512
nemo_curator.stages.text.classifiers.prompt_task_complexity.OUTPUT_FIELDS = ['prompt_complexity_score', 'task_type_1', 'task_type_2', 'task_type_prob', 'cre...
nemo_curator.stages.text.classifiers.prompt_task_complexity.PROMPT_TASK_COMPLEXITY_MODEL_IDENTIFIER = 'nvidia/prompt-task-and-complexity-classifier'