morpheus.models.dfencoder.autoencoder.AutoEncoder

class AutoEncoder(*args, **kwargs)[source]

Bases: torch.nn.Module

Methods

__call__(*args, **kwargs) Call self as a function.
compute_baseline_performance(in_, out_) Baseline performance is computed by generating a strong
compute_loss_from_targets(num, bin, cat, ...) Computes the loss from targets.
decode_outputs_to_df(num, bin, cat) Converts the model outputs of the numerical, binary, and categorical features back into a pandas dataframe.
df_predict(df) Runs end-to-end model.
encode_input(df) Handles raw df inputs.
fit(training_data[, rank, world_size, ...]) Fit the model in a distributed or centralized fashion, depending on self.distributed_training with early stopping based on validation loss.
get_anomaly_score(df) Returns a per-row loss of the input dataframe.
get_anomaly_score_losses(df) Run the input dataframe df through the autoencoder to get the recovery losses by feature type (numerical/boolean/categorical).
get_deep_stack_features(df) records and outputs all internal representations of input df as row-wise vectors.
get_representation(df[, layer]) Computes latent feature vector from hidden layer given input dataframe.
get_results_from_dataset(dataset, preloaded_df) Returns a pandas dataframe of inference results and losses for a given dataset.
prepare_df(df) Does data preparation on copy of input dataframe.
preprocess_data(df, shuffle_rows_in_batch, ...) Preprocesses a pandas dataframe df for input into the autoencoder model.
preprocess_training_data(df[, ...]) Wrapper function round self.preprocess_data feeding in the args suitable for a training set.
preprocess_validation_data(df[, ...]) Wrapper function round self.preprocess_data feeding in the args suitable for a validation set.
build_input_tensor
compute_loss
compute_targets
create_binary_col_max
create_categorical_col_max
create_numerical_col_max
get_anomaly_score_with_losses
get_feature_count
get_results
get_scaler
get_variable_importance
return_feature_names
scale_losses
compute_baseline_performance(in_, out_)[source]
Baseline performance is computed by generating a strong

prediction for the identity function (predicting input==output) with a swapped (noisy) input, and computing the loss against the unaltered original data.

This should be roughly the loss we expect when the encoder degenerates

into the identity function solution.

Returns net loss on baseline performance computation

(sum of all losses)

compute_loss_from_targets(num, bin, cat, num_target, bin_target, cat_target, should_log=True, _id=False)[source]

Computes the loss from targets.

Parameters
num

numerical data tensor

bin

binary data tensor

cat

list of categorical data tensors

num_target

target numerical data tensor

bin_target

target binary data tensor

cat_target

list of target categorical data tensors

should_log

whether to log the loss in self.logger, by default True

_id

whether the current step is an id validation step (for logging), by default False

Returns
Tuple[Union[float, List[float]]]

A tuple containing the mean mse/bce losses, list of mean cce losses, and mean net loss

decode_outputs_to_df(num, bin, cat)[source]

Converts the model outputs of the numerical, binary, and categorical features back into a pandas dataframe.

df_predict(df)[source]

Runs end-to-end model. Interprets output and creates a dataframe. Outputs dataframe with same shape as input containing model predictions.

encode_input(df)[source]

Handles raw df inputs. Passes categories through embedding layers.

fit(training_data, rank=0, world_size=1, epochs=1, validation_data=None, run_validation=False, use_val_for_loss_stats=False)[source]

Fit the model in a distributed or centralized fashion, depending on self.distributed_training with early stopping based on validation loss. If run_validation is True, the validation_dataset will be used for validation during training and early stopping will be applied based on patience argument.

Parameters
training_data

data object of training data

rank

the rank of the current process

world_size

the total number of processes

epochs

the number of epochs to train for, by default 1

validation_data

the validation data object (with __iter__() that yields a batch at a time), by default None

run_validation

whether to perform validation during training, by default False

use_val_for_loss_stats

whether to populate loss stats in the main process (rank 0) for z-score calculation using the validation set. If set to False, loss stats would be populated using the train_dataloader, which can be slow due to data size. By default False, but using the validation set to populate loss stats is strongly recommended (for both efficiency and model efficacy).

Raises
ValueError

If run_validation or use_val_for_loss_stats is True but val is not provided.

get_anomaly_score(df)[source]

Returns a per-row loss of the input dataframe. Does not corrupt inputs.

get_anomaly_score_losses(df)[source]

Run the input dataframe df through the autoencoder to get the recovery losses by feature type (numerical/boolean/categorical).

get_deep_stack_features(df)[source]

records and outputs all internal representations of input df as row-wise vectors. Output is 2-d array with len() == len(df)

get_representation(df, layer=0)[source]

Computes latent feature vector from hidden layer given input dataframe.

argument layer (int) specifies which layer to get. by default (layer=0), returns the “encoding” layer. layer < 0 counts layers back from encoding layer. layer > 0 counts layers forward from encoding layer.

get_results_from_dataset(dataset, preloaded_df, return_abs=False)[source]

Returns a pandas dataframe of inference results and losses for a given dataset. Note. this function requires the whole inference set to be in loaded into memory as a pandas df

Parameters
dataset

dataset for inference

preloaded_df

a pandas dataframe that contains the original data

return_abs

whether the absolute value of the loss scalers should be returned, by default False

Returns
pd.DataFrame

inference result with losses of each feature

prepare_df(df)[source]

Does data preparation on copy of input dataframe.

Parameters
df

The pandas dataframe to process

Returns
pandas.DataFrame

A processed copy of df.

preprocess_data(df, shuffle_rows_in_batch, include_original_input_tensor, include_swapped_input_by_feature_type)[source]

Preprocesses a pandas dataframe df for input into the autoencoder model.

Parameters
df

The input dataframe to preprocess.

shuffle_rows_in_batch

Whether to shuffle the rows of the dataframe before processing.

include_original_input_tensor

Whether to process the df into an input tensor without swapping and include it in the returned data dict. Note. Training required only the swapped input tensor while validation can use both.

include_swapped_input_by_feature_type

Whether to process the swapped df into num/bin/cat feature tensors and include them in the returned data dict. This is useful for baseline performance evaluation for validation.

Returns
Dict[str, Union[int, torch.Tensor]]

A dict containing the preprocessed input data and targets by feature type.

preprocess_training_data(df, shuffle_rows_in_batch=True)[source]

Wrapper function round self.preprocess_data feeding in the args suitable for a training set.

preprocess_validation_data(df, shuffle_rows_in_batch=False)[source]

Wrapper function round self.preprocess_data feeding in the args suitable for a validation set.

Previous morpheus.models.dfencoder.autoencoder
Next morpheus.models.dfencoder.dataframe
© Copyright 2024, NVIDIA. Last updated on Apr 11, 2024.