morpheus.models.dfencoder

class AEModule(*args, **kwargs)[source]

Auto Encoder Pytorch Module.

Methods

__call__(*args, **kwargs)

Call self as a function.

build(numeric_fts, binary_fts, categorical_fts)

Constructs the autoencoder model.

decode(x[, layers])

Decodes the input using the decoder layers and computes the outputs.

encode(x[, layers])

Encodes the input using the encoder layers.

forward(input)

Passes the input through the model and returns the outputs.

build(numeric_fts, binary_fts, categorical_fts)[source]

Constructs the autoencoder model.

Parameters:
numeric_ftsList[str]

The names of the numeric features.

binary_ftsList[str]

The names of the binary features.

categorical_ftsDict[str, Dict[str, List[str]]]

The dictionary mapping categorical feature names to dictionaries containing the categories of the feature.

decode(x, layers=None)[source]

Decodes the input using the decoder layers and computes the outputs.

Parameters:
xtorch.Tensor

The encoded input tensor to decode.

layersint, optional

The number of layers to use for decoding. Defaults to None, will use all decoder layers.

Returns:
tuple of Union[torch.Tensor, List[torch.Tensor]]

A tuple containing the numeric (Tensor), binary (Tensor), and categorical outputs (List[torch.Tensor]) of the model.

encode(x, layers=None)[source]

Encodes the input using the encoder layers.

Parameters:
xtorch.Tensor

The input tensor to encode.

layersint, optional

The number of layers to use for encoding. Defaults to None, will use all encoder layers.

Returns:
torch.Tensor

The encoded output tensor.

forward(input)[source]

Passes the input through the model and returns the outputs.

Parameters:
inputtorch.Tensor

The input tensor.

Returns:
tuple of Union[torch.Tensor, List[torch.Tensor]]

A tuple containing the numeric (Tensor), binary (Tensor), and categorical outputs (List[torch.Tensor]) of the model.

class AutoEncoder(*args, **kwargs)[source]

Methods

__call__(*args, **kwargs)

Call self as a function.

compute_baseline_performance(in_, out_)

Baseline performance is computed by generating a strong

compute_loss_from_targets(num, bin, cat, ...)

Computes the loss from targets.

decode_outputs_to_df(num, bin, cat)

Converts the model outputs of the numerical, binary, and categorical features back into a pandas dataframe.

df_predict(df)

Runs end-to-end model.

encode_input(df)

Handles raw df inputs.

fit(train_data[, epochs, val_data, ...])

Does training in the specified mode (indicated by self.distrivuted_training).

get_anomaly_score(df)

Returns a per-row loss of the input dataframe.

get_anomaly_score_losses(df)

Run the input dataframe df through the autoencoder to get the recovery losses by feature type (numerical/boolean/categorical).

get_deep_stack_features(df)

records and outputs all internal representations of input df as row-wise vectors.

get_representation(df[, layer])

Computes latent feature vector from hidden layer given input dataframe.

get_results_from_dataset(dataset, preloaded_df)

Returns a pandas dataframe of inference results and losses for a given dataset.

prepare_df(df)

Does data preparation on copy of input dataframe.

preprocess_data(df, shuffle_rows_in_batch, ...)

Preprocesses a pandas dataframe df for input into the autoencoder model.

preprocess_train_data(df[, ...])

Wrapper function round self.preprocess_data feeding in the args suitable for a training set.

preprocess_validation_data(df[, ...])

Wrapper function round self.preprocess_data feeding in the args suitable for a validation set.

train_epoch(n_updates, input_df, df[, pbar])

Run regular epoch.

train_megabatch_epoch(n_updates, df)

Run epoch doing 'megabatch' updates, preprocessing data in large chunks.

build_input_tensor

compute_loss

compute_targets

create_binary_col_max

create_categorical_col_max

create_numerical_col_max

do_backward

get_anomaly_score_with_losses

get_results

get_scaler

get_variable_importance

return_feature_names

scale_losses

build_input_tensor(df)[source]

compute_baseline_performance(in_, out_)[source]
Baseline performance is computed by generating a strong

prediction for the identity function (predicting input==output) with a swapped (noisy) input, and computing the loss against the unaltered original data.

This should be roughly the loss we expect when the encoder degenerates

into the identity function solution.

Returns net loss on baseline performance computation

(sum of all losses)

compute_loss(num, bin, cat, target_df, should_log=True, _id=False)[source]

compute_loss_from_targets(num, bin, cat, num_target, bin_target, cat_target, should_log=True, _id=False)[source]

Computes the loss from targets.

Parameters:
numtorch.Tensor

numerical data tensor

bintorch.Tensor

binary data tensor

catList[torch.Tensor]

list of categorical data tensors

num_targettorch.Tensor

target numerical data tensor

bin_targettorch.Tensor

target binary data tensor

cat_targetList[torch.Tensor]

list of target categorical data tensors

should_logbool, optional

whether to log the loss in self.logger, by default True

_idbool, optional

whether the current step is an id validation step (for logging), by default False

Returns:
Tuple[Union[float, List[float]]]

A tuple containing the mean mse/bce losses, list of mean cce losses, and mean net loss

compute_targets(df)[source]

create_binary_col_max(bin_names, bce_loss)[source]

create_categorical_col_max(cat_names, cce_loss)[source]

create_numerical_col_max(num_names, mse_loss)[source]

decode_outputs_to_df(num, bin, cat)[source]

Converts the model outputs of the numerical, binary, and categorical features back into a pandas dataframe.

df_predict(df)[source]

Runs end-to-end model. Interprets output and creates a dataframe. Outputs dataframe with same shape as input containing model predictions.

do_backward(mse, bce, cce)[source]

encode_input(df)[source]

Handles raw df inputs. Passes categories through embedding layers.

fit(train_data, epochs=1, val_data=None, run_validation=False, use_val_for_loss_stats=False, rank=None, world_size=None)[source]

Does training in the specified mode (indicated by self.distrivuted_training).

Parameters:
train_datapandas.DataFrame (centralized) or torch.utils.data.DataLoader (distributed)

Data for training.

epochsint, optional

Number of epochs to run training, by default 1.

val_datapandas.DataFrame (centralized) or torch.utils.data.DataLoader (distributed), optional

Data for validation and computing loss stats, by default None.

run_validationbool, optional

Whether to collect validation loss for each epoch during training, by default False.

use_val_for_loss_statsbool, optional

whether to use the validation set for loss statistics collection (for z score calculation), by default False.

rankint, optional

The rank of the current process, by default None. Required for distributed training.

world_sizeint, optional

The total number of processes, by default None. Required for distributed training.

Raises:
TypeError

If train_data is not a pandas dataframe in centralized training mode.

ValueError

If rank and world_size not provided in distributed training mode.

TypeError

If train_data is not a pandas dataframe or a torch.utils.data.DataLoader or a torch.utils.data.Dataset in distributed training mode.

get_anomaly_score(df)[source]

Returns a per-row loss of the input dataframe. Does not corrupt inputs.

get_anomaly_score_losses(df)[source]

Run the input dataframe df through the autoencoder to get the recovery losses by feature type (numerical/boolean/categorical).

get_anomaly_score_with_losses(df)[source]

get_deep_stack_features(df)[source]

records and outputs all internal representations of input df as row-wise vectors. Output is 2-d array with len() == len(df)

get_representation(df, layer=0)[source]

Computes latent feature vector from hidden layer given input dataframe.

argument layer (int) specifies which layer to get. by default (layer=0), returns the “encoding” layer. layer < 0 counts layers back from encoding layer. layer > 0 counts layers forward from encoding layer.

get_results(df, return_abs=False)[source]

get_results_from_dataset(dataset, preloaded_df, return_abs=False)[source]

Returns a pandas dataframe of inference results and losses for a given dataset. Note. this function requires the whole inference set to be in loaded into memory as a pandas df

Parameters:
datasettorch.utils.data.Dataset

dataset for inference

preloaded_dfpd.DataFrame

a pandas dataframe that contains the original data

return_absbool, optional

whether the absolute value of the loss scalers should be returned, by default False

Returns:
pd.DataFrame

inference result with losses of each feature

get_scaler(name)[source]

get_variable_importance(num_names, cat_names, bin_names, mse_loss, bce_loss, cce_loss, cloudtrail_df)[source]

prepare_df(df)[source]

Does data preparation on copy of input dataframe.

Parameters:
dfpandas.DataFrame

The pandas dataframe to process

Returns:
pandas.DataFrame

A processed copy of df.

preprocess_data(df, shuffle_rows_in_batch, include_original_input_tensor, include_swapped_input_by_feature_type)[source]

Preprocesses a pandas dataframe df for input into the autoencoder model.

Parameters:
dfpandas.DataFrame

The input dataframe to preprocess.

shuffle_rows_in_batchbool

Whether to shuffle the rows of the dataframe before processing.

include_original_input_tensorbool

Whether to process the df into an input tensor without swapping and include it in the returned data dict. Note. Training required only the swapped input tensor while validation can use both.

include_swapped_input_by_feature_typebool

Whether to process the swapped df into num/bin/cat feature tensors and include them in the returned data dict. This is useful for baseline performance evaluation for validation.

Returns:
Dict[str, Union[int, torch.Tensor]]

A dict containing the preprocessed input data and targets by feature type.

preprocess_train_data(df, shuffle_rows_in_batch=True)[source]

Wrapper function round self.preprocess_data feeding in the args suitable for a training set.

preprocess_validation_data(df, shuffle_rows_in_batch=False)[source]

Wrapper function round self.preprocess_data feeding in the args suitable for a validation set.

return_feature_names()[source]

scale_losses(mse, bce, cce)[source]

train_epoch(n_updates, input_df, df, pbar=None)[source]

Run regular epoch.

train_megabatch_epoch(n_updates, df)[source]

Run epoch doing ‘megabatch’ updates, preprocessing data in large chunks.

class BasicLogger(fts, baseline_loss=0.0)[source]

A minimal class for logging training progress.

Methods

end_epoch

id_val_step

training_step

val_step

end_epoch()[source]

id_val_step(losses)[source]

training_step(losses)[source]

val_step(losses)[source]

class CompleteLayer(*args, **kwargs)[source]

Impliments a layer with linear transformation and optional activation and dropout.

Methods

__call__(*args, **kwargs)

Call self as a function.

forward(x)

Performs a forward pass through the CompleteLayer object.

interpret_activation([act])

Interprets the name of the activation function and returns the appropriate PyTorch function.

forward(x)[source]

Performs a forward pass through the CompleteLayer object.

Parameters:
xtensor

The input tensor to the CompleteLayer object.

Returns:
torch.Tensor

The output tensor of the CompleteLayer object after processing the input through all layers.

interpret_activation(act=None)[source]

Interprets the name of the activation function and returns the appropriate PyTorch function.

Parameters:
actstr, optional

The name of the activation function to interpret. Defaults to None if no activation function is desired.

Returns:
PyTorch function

The PyTorch activation function that corresponds to the given name.

Raises:
Exception

If the activation function name is not recognized.

class DFEncoderDataLoader(*args, **kwargs)[source]

Methods

__call__(*args, **kwargs)

Call self as a function.

get_distributed_training_dataloader_from_dataset(...)

Returns a distributed training DataLoader given a dataset and other arguments.

get_distributed_training_dataloader_from_df(...)

A helper funtion to get a distributed training DataLoader given a pandas dataframe.

get_distributed_training_dataloader_from_path(...)

A helper funtion to get a distributed training DataLoader given a path to a folder containing data.

static get_distributed_training_dataloader_from_dataset(dataset, rank, world_size, pin_memory=False, num_workers=0)[source]

Returns a distributed training DataLoader given a dataset and other arguments.

Parameters:
datasetDataset

The dataset to load the data from.

rankint

The rank of the current process.

world_sizeint

The number of processes to distribute the data across.

pin_memorybool, optional

Whether to pin memory when loading data, by default False.

num_workersint, optional

The number of worker processes to use for loading data, by default 0.

Returns:
DataLoader

The training DataLoader with DistributedSampler for distributed training.

static get_distributed_training_dataloader_from_df(model, df, rank, world_size, pin_memory=False, num_workers=0)[source]

A helper funtion to get a distributed training DataLoader given a pandas dataframe.

Parameters:
modelAutoEncoder

The autoencoder model used to get relevant params and the preprocessing func.

dfpandas.DataFrame

The pandas dataframe containing the data.

rankint

The rank of the current process.

world_sizeint

The number of processes to distribute the data across.

pin_memorybool, optional

Whether to pin memory when loading data, by default False.

num_workersint, optional

The number of worker processes to use for loading data, by default 0.

Returns:
DFEncoderDataLoader

The training DataLoader with DistributedSampler for distributed training.

static get_distributed_training_dataloader_from_path(model, data_folder, rank, world_size, load_data_fn=pandas.read_csv, pin_memory=False, num_workers=0)[source]

A helper funtion to get a distributed training DataLoader given a path to a folder containing data.

Parameters:
modelAutoEncoder

The autoencoder model used to get relevant params and the preprocessing func.

data_folderstr

The path to the folder containing the data.

rankint

The rank of the current process.

world_sizeint

The number of processes to distribute the data across.

load_data_fnfunction, optional

A function for loading data from a provided file path into a pandas.DataFrame, by default pd.read_csv.

pin_memorybool, optional

Whether to pin memory when loading data, by default False.

num_workersint, optional

The number of worker processes to use for loading data, by default 0.

Returns:
DFEncoderDataLoader

The training DataLoader with DistributedSampler for distributed training.

class DatasetFromDataframe(*args, **kwargs)[source]
Attributes:
num_samples

Returns the number of samples in the dataset.

Methods

__call__(*args, **kwargs)

Call self as a function.

convert_to_validation(model)

Converts the dataset to validation mode by resetting instance variables.

get_train_dataset(model, df)

A helper function to get a train dataset with the provided parameters.

get_validation_dataset(model, df)

A helper function to get a validation dataset with the provided parameters.

convert_to_validation(model)[source]

Converts the dataset to validation mode by resetting instance variables.

Parameters:
modelAutoEncoder

The autoencoder model used to get relevant params and the preprocessing func.

static get_train_dataset(model, df)[source]

A helper function to get a train dataset with the provided parameters.

Parameters:
modelAutoEncoder

The autoencoder model used to get relevant params and the preprocessing func.

dfpandas.DataFrame

Input dataframe used for the dataset.

Returns:
DatasetFromDataframe

Training Dataset set up to load from the dataframe.

static get_validation_dataset(model, df)[source]

A helper function to get a validation dataset with the provided parameters.

Parameters:
modelAutoEncoder

The autoencoder model used to get relevant params and the preprocessing func.

dfpandas.DataFrame

Input dataframe used for the dataset.

Returns:
DatasetFromDataframe

Validation Dataset set up to load from the dataframe.

property num_samples

Returns the number of samples in the dataset.

class DatasetFromPath(*args, **kwargs)[source]

A dataset class that reads data in batches from a folder and applies preprocessing to each batch. * This class assumes that the data is saved in small csv files in one folder.

Attributes:
num_samples

Returns the number of samples in the dataset.

Methods

__call__(*args, **kwargs)

Call self as a function.

convert_to_validation(model)

Converts the dataset to validation mode by resetting instance variables.

get_preloaded_data()

Loads all data from the files into memory and returns it as a pandas.DataFrame.

get_train_dataset(model, data_folder[, ...])

A helper function to get a train dataset with the provided parameters.

get_validation_dataset(model, data_folder[, ...])

A helper function to get a validation dataset with the provided parameters.

convert_to_validation(model)[source]

Converts the dataset to validation mode by resetting instance variables.

Parameters:
modelAutoEncoder

The autoencoder model used to get relevant params and the preprocessing func.

get_preloaded_data()[source]

Loads all data from the files into memory and returns it as a pandas.DataFrame.

static get_train_dataset(model, data_folder, load_data_fn=pandas.read_csv, preload_data_into_memory=False)[source]

A helper function to get a train dataset with the provided parameters.

Parameters:
modelAutoEncoder

The autoencoder model used to get relevant params and the preprocessing func.

data_folderstr

The path to the folder containing the data.

load_data_fnfunction, optional

A function for loading data from a provided file path into a pandas.DataFrame, by default pd.read_csv.

preload_data_into_memorybool, optional

Whether to preload all the data into memory, by default False.

Returns:
DatasetFromPath

Validation Dataset set up to load from the path.

static get_validation_dataset(model, data_folder, load_data_fn=pandas.read_csv, preload_data_into_memory=True)[source]

A helper function to get a validation dataset with the provided parameters.

Parameters:
modelAutoEncoder

The autoencoder model used to get relevant params and the preprocessing func.

data_folderstr

The path to the folder containing the data.

load_data_fnfunction, optional

A function for loading data from a provided file path into a pandas.DataFrame, by default pd.read_csv.

preload_data_into_memorybool, optional

Whether to preload all the data into memory, by default True. (can speed up data loading if the data can fit into memory)

Returns:
DatasetFromPath

Validation Dataset set up to load from the path.

property num_samples

Returns the number of samples in the dataset.

class DistributedAutoEncoder(*args, **kwargs)[source]

Methods

__call__(*args, **kwargs)

Call self as a function.

class EncoderDataFrame(*args, **kwargs)[source]

Methods

__call__(*args, **kwargs)

Call self as a function.

swap([likelihood])

Performs random swapping of data.

swap(likelihood=0.15)[source]

Performs random swapping of data.

Parameters:
likelihoodfloat, optional

The probability of a value being randomly replaced with a value from a different row. By default .15

Returns:
pandas.DataFrame

A copy of the dataframe with equal size.

class GaussRankScaler[source]

So-called “Gauss Rank” scaling. Forces a transformation, uses bins to perform inverse mapping.

Uses sklearn QuantileTransformer to work.

Methods

fit

fit_transform

inverse_transform

transform

fit(x)[source]

fit_transform(x)[source]

inverse_transform(x)[source]

transform(x)[source]

class IpynbLogger(*args, **kwargs)[source]

Plots Logging Data in jupyter notebook

Methods

end_epoch

id_val_step

plot_progress

training_step

val_step

end_epoch(val_losses=None)[source]

plot_progress()[source]

class ModifiedScaler[source]

Implements scaling using modified z score. Reference: https://www.ibm.com/docs/el/cognos-analytics/11.1.0?topic=terms-modified-z-score

Methods

fit

fit_transform

inverse_transform

transform

MAD_SCALING_FACTOR = 1.486

MEANAD_SCALING_FACTOR = 1.253314

fit(x)[source]

fit_transform(x)[source]

inverse_transform(x)[source]

transform(x)[source]

class NullScaler[source]

Methods

fit

fit_transform

inverse_transform

transform

fit(x)[source]

fit_transform(x)[source]

inverse_transform(x)[source]

transform(x)[source]

class StandardScaler[source]

Impliments standard (mean/std) scaling.

Methods

fit

fit_transform

inverse_transform

transform

fit(x)[source]

fit_transform(x)[source]

inverse_transform(x)[source]

transform(x)[source]

class TensorboardXLogger(logdir='logdir/', run=None, *args, **kwargs)[source]

Methods

end_epoch

id_val_step

show_embeddings

training_step

val_step

end_epoch(val_losses=None)[source]

id_val_step(losses)[source]

show_embeddings(categories)[source]

training_step(losses)[source]

val_step(losses)[source]

Modules

morpheus.models.dfencoder.ae_module

morpheus.models.dfencoder.autoencoder

morpheus.models.dfencoder.dataframe

morpheus.models.dfencoder.dataloader

morpheus.models.dfencoder.distributed_ae

morpheus.models.dfencoder.logging

morpheus.models.dfencoder.multiprocessing

morpheus.models.dfencoder.scalers

© Copyright 2023, NVIDIA. Last updated on Oct 12, 2023.