morpheus.models.dfencoder.autoencoder.AutoEncoder

class AutoEncoder(*args, **kwargs)[source]

Bases: torch.nn.Module

Methods

__call__(*args, **kwargs)

Call self as a function.

compute_baseline_performance(in_, out_)

Baseline performance is computed by generating a strong

compute_loss_from_targets(num, bin, cat, ...)

Computes the loss from targets.

decode_outputs_to_df(num, bin, cat)

Converts the model outputs of the numerical, binary, and categorical features back into a pandas dataframe.

df_predict(df)

Runs end-to-end model.

encode_input(df)

Handles raw df inputs.

fit(train_data[, epochs, val_data, ...])

Does training in the specified mode (indicated by self.distrivuted_training).

get_anomaly_score(df)

Returns a per-row loss of the input dataframe.

get_anomaly_score_losses(df)

Run the input dataframe df through the autoencoder to get the recovery losses by feature type (numerical/boolean/categorical).

get_deep_stack_features(df)

records and outputs all internal representations of input df as row-wise vectors.

get_representation(df[, layer])

Computes latent feature vector from hidden layer given input dataframe.

get_results_from_dataset(dataset, preloaded_df)

Returns a pandas dataframe of inference results and losses for a given dataset.

prepare_df(df)

Does data preparation on copy of input dataframe.

preprocess_data(df, shuffle_rows_in_batch, ...)

Preprocesses a pandas dataframe df for input into the autoencoder model.

preprocess_train_data(df[, ...])

Wrapper function round self.preprocess_data feeding in the args suitable for a training set.

preprocess_validation_data(df[, ...])

Wrapper function round self.preprocess_data feeding in the args suitable for a validation set.

train_epoch(n_updates, input_df, df[, pbar])

Run regular epoch.

train_megabatch_epoch(n_updates, df)

Run epoch doing 'megabatch' updates, preprocessing data in large chunks.

build_input_tensor

compute_loss

compute_targets

create_binary_col_max

create_categorical_col_max

create_numerical_col_max

do_backward

get_anomaly_score_with_losses

get_results

get_scaler

get_variable_importance

return_feature_names

scale_losses

compute_baseline_performance(in_, out_)[source]
Baseline performance is computed by generating a strong

prediction for the identity function (predicting input==output) with a swapped (noisy) input, and computing the loss against the unaltered original data.

This should be roughly the loss we expect when the encoder degenerates

into the identity function solution.

Returns net loss on baseline performance computation

(sum of all losses)

compute_loss_from_targets(num, bin, cat, num_target, bin_target, cat_target, should_log=True, _id=False)[source]

Computes the loss from targets.

Parameters:
numtorch.Tensor

numerical data tensor

bintorch.Tensor

binary data tensor

catList[torch.Tensor]

list of categorical data tensors

num_targettorch.Tensor

target numerical data tensor

bin_targettorch.Tensor

target binary data tensor

cat_targetList[torch.Tensor]

list of target categorical data tensors

should_logbool, optional

whether to log the loss in self.logger, by default True

_idbool, optional

whether the current step is an id validation step (for logging), by default False

Returns:
Tuple[Union[float, List[float]]]

A tuple containing the mean mse/bce losses, list of mean cce losses, and mean net loss

decode_outputs_to_df(num, bin, cat)[source]

Converts the model outputs of the numerical, binary, and categorical features back into a pandas dataframe.

df_predict(df)[source]

Runs end-to-end model. Interprets output and creates a dataframe. Outputs dataframe with same shape as input containing model predictions.

encode_input(df)[source]

Handles raw df inputs. Passes categories through embedding layers.

fit(train_data, epochs=1, val_data=None, run_validation=False, use_val_for_loss_stats=False, rank=None, world_size=None)[source]

Does training in the specified mode (indicated by self.distrivuted_training).

Parameters:
train_datapandas.DataFrame (centralized) or torch.utils.data.DataLoader (distributed)

Data for training.

epochsint, optional

Number of epochs to run training, by default 1.

val_datapandas.DataFrame (centralized) or torch.utils.data.DataLoader (distributed), optional

Data for validation and computing loss stats, by default None.

run_validationbool, optional

Whether to collect validation loss for each epoch during training, by default False.

use_val_for_loss_statsbool, optional

whether to use the validation set for loss statistics collection (for z score calculation), by default False.

rankint, optional

The rank of the current process, by default None. Required for distributed training.

world_sizeint, optional

The total number of processes, by default None. Required for distributed training.

Raises:
TypeError

If train_data is not a pandas dataframe in centralized training mode.

ValueError

If rank and world_size not provided in distributed training mode.

TypeError

If train_data is not a pandas dataframe or a torch.utils.data.DataLoader or a torch.utils.data.Dataset in distributed training mode.

get_anomaly_score(df)[source]

Returns a per-row loss of the input dataframe. Does not corrupt inputs.

get_anomaly_score_losses(df)[source]

Run the input dataframe df through the autoencoder to get the recovery losses by feature type (numerical/boolean/categorical).

get_deep_stack_features(df)[source]

records and outputs all internal representations of input df as row-wise vectors. Output is 2-d array with len() == len(df)

get_representation(df, layer=0)[source]

Computes latent feature vector from hidden layer given input dataframe.

argument layer (int) specifies which layer to get. by default (layer=0), returns the “encoding” layer. layer < 0 counts layers back from encoding layer. layer > 0 counts layers forward from encoding layer.

get_results_from_dataset(dataset, preloaded_df, return_abs=False)[source]

Returns a pandas dataframe of inference results and losses for a given dataset. Note. this function requires the whole inference set to be in loaded into memory as a pandas df

Parameters:
datasettorch.utils.data.Dataset

dataset for inference

preloaded_dfpd.DataFrame

a pandas dataframe that contains the original data

return_absbool, optional

whether the absolute value of the loss scalers should be returned, by default False

Returns:
pd.DataFrame

inference result with losses of each feature

prepare_df(df)[source]

Does data preparation on copy of input dataframe.

Parameters:
dfpandas.DataFrame

The pandas dataframe to process

Returns:
pandas.DataFrame

A processed copy of df.

preprocess_data(df, shuffle_rows_in_batch, include_original_input_tensor, include_swapped_input_by_feature_type)[source]

Preprocesses a pandas dataframe df for input into the autoencoder model.

Parameters:
dfpandas.DataFrame

The input dataframe to preprocess.

shuffle_rows_in_batchbool

Whether to shuffle the rows of the dataframe before processing.

include_original_input_tensorbool

Whether to process the df into an input tensor without swapping and include it in the returned data dict. Note. Training required only the swapped input tensor while validation can use both.

include_swapped_input_by_feature_typebool

Whether to process the swapped df into num/bin/cat feature tensors and include them in the returned data dict. This is useful for baseline performance evaluation for validation.

Returns:
Dict[str, Union[int, torch.Tensor]]

A dict containing the preprocessed input data and targets by feature type.

preprocess_train_data(df, shuffle_rows_in_batch=True)[source]

Wrapper function round self.preprocess_data feeding in the args suitable for a training set.

preprocess_validation_data(df, shuffle_rows_in_batch=False)[source]

Wrapper function round self.preprocess_data feeding in the args suitable for a validation set.

train_epoch(n_updates, input_df, df, pbar=None)[source]

Run regular epoch.

train_megabatch_epoch(n_updates, df)[source]

Run epoch doing ‘megabatch’ updates, preprocessing data in large chunks.

© Copyright 2023, NVIDIA. Last updated on Aug 23, 2023.