morpheus.models.dfencoder

class AEModule(*args, **kwargs)[source]

Auto Encoder Pytorch Module.

Methods

__call__(*args, **kwargs) Call self as a function.
build(numeric_fts, binary_fts, categorical_fts) Constructs the autoencoder model.
decode(x[, layers]) Decodes the input using the decoder layers and computes the outputs.
encode(x[, layers]) Encodes the input using the encoder layers.
forward(input) Passes the input through the model and returns the outputs.
build(numeric_fts, binary_fts, categorical_fts)[source]

Constructs the autoencoder model.

Parameters
numeric_fts

The names of the numeric features.

binary_fts

The names of the binary features.

categorical_fts

The dictionary mapping categorical feature names to dictionaries containing the categories of the feature.

decode(x, layers=None)[source]

Decodes the input using the decoder layers and computes the outputs.

Parameters
x

The encoded input tensor to decode.

layers

The number of layers to use for decoding. Defaults to None, will use all decoder layers.

Returns
tuple of Union[torch.Tensor, List[torch.Tensor]]

A tuple containing the numeric (Tensor), binary (Tensor), and categorical outputs (List[torch.Tensor]) of the model.

encode(x, layers=None)[source]

Encodes the input using the encoder layers.

Parameters
x

The input tensor to encode.

layers

The number of layers to use for encoding. Defaults to None, will use all encoder layers.

Returns
torch.Tensor

The encoded output tensor.

forward(input)[source]

Passes the input through the model and returns the outputs.

Parameters
input

The input tensor.

Returns
tuple of Union[torch.Tensor, List[torch.Tensor]]

A tuple containing the numeric (Tensor), binary (Tensor), and categorical outputs (List[torch.Tensor]) of the model.

class AutoEncoder(*args, **kwargs)[source]

Methods

__call__(*args, **kwargs) Call self as a function.
compute_baseline_performance(in_, out_) Baseline performance is computed by generating a strong
compute_loss_from_targets(num, bin, cat, ...) Computes the loss from targets.
decode_outputs_to_df(num, bin, cat) Converts the model outputs of the numerical, binary, and categorical features back into a pandas dataframe.
df_predict(df) Runs end-to-end model.
encode_input(df) Handles raw df inputs.
fit(training_data[, rank, world_size, ...]) Fit the model in a distributed or centralized fashion, depending on self.distributed_training with early stopping based on validation loss.
get_anomaly_score(df) Returns a per-row loss of the input dataframe.
get_anomaly_score_losses(df) Run the input dataframe df through the autoencoder to get the recovery losses by feature type (numerical/boolean/categorical).
get_deep_stack_features(df) records and outputs all internal representations of input df as row-wise vectors.
get_representation(df[, layer]) Computes latent feature vector from hidden layer given input dataframe.
get_results_from_dataset(dataset, preloaded_df) Returns a pandas dataframe of inference results and losses for a given dataset.
prepare_df(df) Does data preparation on copy of input dataframe.
preprocess_data(df, shuffle_rows_in_batch, ...) Preprocesses a pandas dataframe df for input into the autoencoder model.
preprocess_training_data(df[, ...]) Wrapper function round self.preprocess_data feeding in the args suitable for a training set.
preprocess_validation_data(df[, ...]) Wrapper function round self.preprocess_data feeding in the args suitable for a validation set.
build_input_tensor
compute_loss
compute_targets
create_binary_col_max
create_categorical_col_max
create_numerical_col_max
get_anomaly_score_with_losses
get_feature_count
get_results
get_scaler
get_variable_importance
return_feature_names
scale_losses
build_input_tensor(df)[source]

compute_baseline_performance(in_, out_)[source]
Baseline performance is computed by generating a strong

prediction for the identity function (predicting input==output) with a swapped (noisy) input, and computing the loss against the unaltered original data.

This should be roughly the loss we expect when the encoder degenerates

into the identity function solution.

Returns net loss on baseline performance computation

(sum of all losses)

compute_loss(num, bin, cat, target_df, should_log=True, _id=False)[source]

compute_loss_from_targets(num, bin, cat, num_target, bin_target, cat_target, should_log=True, _id=False)[source]

Computes the loss from targets.

Parameters
num

numerical data tensor

bin

binary data tensor

cat

list of categorical data tensors

num_target

target numerical data tensor

bin_target

target binary data tensor

cat_target

list of target categorical data tensors

should_log

whether to log the loss in self.logger, by default True

_id

whether the current step is an id validation step (for logging), by default False

Returns
Tuple[Union[float, List[float]]]

A tuple containing the mean mse/bce losses, list of mean cce losses, and mean net loss

compute_targets(df)[source]

create_binary_col_max(bin_names, bce_loss)[source]

create_categorical_col_max(cat_names, cce_loss)[source]

create_numerical_col_max(num_names, mse_loss)[source]

decode_outputs_to_df(num, bin, cat)[source]

Converts the model outputs of the numerical, binary, and categorical features back into a pandas dataframe.

df_predict(df)[source]

Runs end-to-end model. Interprets output and creates a dataframe. Outputs dataframe with same shape as input containing model predictions.

encode_input(df)[source]

Handles raw df inputs. Passes categories through embedding layers.

fit(training_data, rank=0, world_size=1, epochs=1, validation_data=None, run_validation=False, use_val_for_loss_stats=False)[source]

Fit the model in a distributed or centralized fashion, depending on self.distributed_training with early stopping based on validation loss. If run_validation is True, the validation_dataset will be used for validation during training and early stopping will be applied based on patience argument.

Parameters
training_data

data object of training data

rank

the rank of the current process

world_size

the total number of processes

epochs

the number of epochs to train for, by default 1

validation_data

the validation data object (with __iter__() that yields a batch at a time), by default None

run_validation

whether to perform validation during training, by default False

use_val_for_loss_stats

whether to populate loss stats in the main process (rank 0) for z-score calculation using the validation set. If set to False, loss stats would be populated using the train_dataloader, which can be slow due to data size. By default False, but using the validation set to populate loss stats is strongly recommended (for both efficiency and model efficacy).

Raises
ValueError

If run_validation or use_val_for_loss_stats is True but val is not provided.

get_anomaly_score(df)[source]

Returns a per-row loss of the input dataframe. Does not corrupt inputs.

get_anomaly_score_losses(df)[source]

Run the input dataframe df through the autoencoder to get the recovery losses by feature type (numerical/boolean/categorical).

get_anomaly_score_with_losses(df)[source]

get_deep_stack_features(df)[source]

records and outputs all internal representations of input df as row-wise vectors. Output is 2-d array with len() == len(df)

get_feature_count()[source]

get_representation(df, layer=0)[source]

Computes latent feature vector from hidden layer given input dataframe.

argument layer (int) specifies which layer to get. by default (layer=0), returns the “encoding” layer. layer < 0 counts layers back from encoding layer. layer > 0 counts layers forward from encoding layer.

get_results(df, return_abs=False)[source]

get_results_from_dataset(dataset, preloaded_df, return_abs=False)[source]

Returns a pandas dataframe of inference results and losses for a given dataset. Note. this function requires the whole inference set to be in loaded into memory as a pandas df

Parameters
dataset

dataset for inference

preloaded_df

a pandas dataframe that contains the original data

return_abs

whether the absolute value of the loss scalers should be returned, by default False

Returns
pd.DataFrame

inference result with losses of each feature

get_scaler(name)[source]

get_variable_importance(num_names, cat_names, bin_names, mse_loss, bce_loss, cce_loss, cloudtrail_df)[source]

prepare_df(df)[source]

Does data preparation on copy of input dataframe.

Parameters
df

The pandas dataframe to process

Returns
pandas.DataFrame

A processed copy of df.

preprocess_data(df, shuffle_rows_in_batch, include_original_input_tensor, include_swapped_input_by_feature_type)[source]

Preprocesses a pandas dataframe df for input into the autoencoder model.

Parameters
df

The input dataframe to preprocess.

shuffle_rows_in_batch

Whether to shuffle the rows of the dataframe before processing.

include_original_input_tensor

Whether to process the df into an input tensor without swapping and include it in the returned data dict. Note. Training required only the swapped input tensor while validation can use both.

include_swapped_input_by_feature_type

Whether to process the swapped df into num/bin/cat feature tensors and include them in the returned data dict. This is useful for baseline performance evaluation for validation.

Returns
Dict[str, Union[int, torch.Tensor]]

A dict containing the preprocessed input data and targets by feature type.

preprocess_training_data(df, shuffle_rows_in_batch=True)[source]

Wrapper function round self.preprocess_data feeding in the args suitable for a training set.

preprocess_validation_data(df, shuffle_rows_in_batch=False)[source]

Wrapper function round self.preprocess_data feeding in the args suitable for a validation set.

return_feature_names()[source]

scale_losses(mse, bce, cce)[source]

class BasicLogger(fts, baseline_loss=0.0)[source]

A minimal class for logging training progress.

Methods

end_epoch
id_val_step
training_step
val_step
end_epoch()[source]

id_val_step(losses)[source]

training_step(losses)[source]

val_step(losses)[source]

class CompleteLayer(*args, **kwargs)[source]

Impliments a layer with linear transformation and optional activation and dropout.

Methods

__call__(*args, **kwargs) Call self as a function.
forward(x) Performs a forward pass through the CompleteLayer object.
interpret_activation([act]) Interprets the name of the activation function and returns the appropriate PyTorch function.
forward(x)[source]

Performs a forward pass through the CompleteLayer object.

Parameters
x

The input tensor to the CompleteLayer object.

Returns
torch.Tensor

The output tensor of the CompleteLayer object after processing the input through all layers.

interpret_activation(act=None)[source]

Interprets the name of the activation function and returns the appropriate PyTorch function.

Parameters
act

The name of the activation function to interpret. Defaults to None if no activation function is desired.

Returns
PyTorch function

The PyTorch activation function that corresponds to the given name.

Raises
Exception

If the activation function name is not recognized.

class DFEncoderDataLoader(*args, **kwargs)[source]

Methods

__call__(*args, **kwargs) Call self as a function.
get_distributed_training_dataloader_from_dataset(...) Returns a distributed training DataLoader given a dataset and other arguments.
get_distributed_training_dataloader_from_df(...) A helper funtion to get a distributed training DataLoader given a pandas dataframe.
get_distributed_training_dataloader_from_path(...) A helper funtion to get a distributed training DataLoader given a path to a folder containing data.
static get_distributed_training_dataloader_from_dataset(dataset, rank, world_size, pin_memory=False, num_workers=0)[source]

Returns a distributed training DataLoader given a dataset and other arguments.

Parameters
dataset

The dataset to load the data from.

rank

The rank of the current process.

world_size

The number of processes to distribute the data across.

pin_memory

Whether to pin memory when loading data, by default False.

num_workers

The number of worker processes to use for loading data, by default 0.

Returns
DataLoader

The training DataLoader with DistributedSampler for distributed training.

static get_distributed_training_dataloader_from_df(model, df, rank, world_size, pin_memory=False, num_workers=0)[source]

A helper funtion to get a distributed training DataLoader given a pandas dataframe.

Parameters
model

The autoencoder model used to get relevant params and the preprocessing func.

df

The pandas dataframe containing the data.

rank

The rank of the current process.

world_size

The number of processes to distribute the data across.

pin_memory

Whether to pin memory when loading data, by default False.

num_workers

The number of worker processes to use for loading data, by default 0.

Returns
DFEncoderDataLoader

The training DataLoader with DistributedSampler for distributed training.

static get_distributed_training_dataloader_from_path(model, data_folder, rank, world_size, load_data_fn=pandas.read_csv, pin_memory=False, num_workers=0)[source]

A helper funtion to get a distributed training DataLoader given a path to a folder containing data.

Parameters
model

The autoencoder model used to get relevant params and the preprocessing func.

data_folder

The path to the folder containing the data.

rank

The rank of the current process.

world_size

The number of processes to distribute the data across.

load_data_fn

A function for loading data from a provided file path into a pandas.DataFrame, by default pd.read_csv.

pin_memory

Whether to pin memory when loading data, by default False.

num_workers

The number of worker processes to use for loading data, by default 0.

Returns
DFEncoderDataLoader

The training DataLoader with DistributedSampler for distributed training.

class DataframeDataset(*args, **kwargs)[source]
Attributes
batch_size

num_samples

Returns the number of samples in the dataset.

preprocess_fn

shuffle_batch_indices

shuffle_rows_in_batch

Methods

__call__(*args, **kwargs) Call self as a function.
property batch_size

property num_samples

Returns the number of samples in the dataset.

property preprocess_fn

property shuffle_batch_indices

property shuffle_rows_in_batch

class DistributedAutoEncoder(*args, **kwargs)[source]

Methods

__call__(*args, **kwargs) Call self as a function.
class EncoderDataFrame(*args, **kwargs)[source]

Methods

__call__(*args, **kwargs) Call self as a function.
swap([likelihood]) Performs random swapping of data.
swap(likelihood=0.15)[source]

Performs random swapping of data.

Parameters
likelihood

The probability of a value being randomly replaced with a value from a different row. By default .15

Returns
pandas.DataFrame

A copy of the dataframe with equal size.

class FileSystemDataset(*args, **kwargs)[source]

A dataset class that reads data in batches from a folder and applies preprocessing to each batch. * This class assumes that the data is saved in small csv files in one folder.

Attributes
batch_size

num_samples

Returns the number of samples in the dataset.

preprocess_fn

shuffle_batch_indices

shuffle_rows_in_batch

Methods

__call__(*args, **kwargs) Call self as a function.
get_preloaded_data() Loads all data from the files into memory and returns it as a pandas.DataFrame.
property batch_size

get_preloaded_data()[source]

Loads all data from the files into memory and returns it as a pandas.DataFrame.

property num_samples

Returns the number of samples in the dataset.

property preprocess_fn

property shuffle_batch_indices

property shuffle_rows_in_batch

class GaussRankScaler[source]

So-called “Gauss Rank” scaling. Forces a transformation, uses bins to perform inverse mapping.

Uses sklearn QuantileTransformer to work.

Methods

fit
fit_transform
inverse_transform
transform
fit(x)[source]

fit_transform(x)[source]

inverse_transform(x)[source]

transform(x)[source]

class IpynbLogger(*args, **kwargs)[source]

Plots Logging Data in jupyter notebook

Methods

end_epoch
id_val_step
plot_progress
training_step
val_step
end_epoch(val_losses=None)[source]

plot_progress()[source]

class ModifiedScaler[source]

Implements scaling using modified z score. Reference: https://www.ibm.com/docs/el/cognos-analytics/11.1.0?topic=terms-modified-z-score

Methods

fit
fit_transform
inverse_transform
transform
MAD_SCALING_FACTOR = 1.486

MEANAD_SCALING_FACTOR = 1.253314

fit(x)[source]

fit_transform(x)[source]

inverse_transform(x)[source]

transform(x)[source]

class NullScaler[source]

Methods

fit
fit_transform
inverse_transform
transform
fit(x)[source]

fit_transform(x)[source]

inverse_transform(x)[source]

transform(x)[source]

class StandardScaler[source]

Impliments standard (mean/std) scaling.

Methods

fit
fit_transform
inverse_transform
transform
fit(x)[source]

fit_transform(x)[source]

inverse_transform(x)[source]

transform(x)[source]

class TensorboardXLogger(logdir='logdir/', run=None, *args, **kwargs)[source]

Methods

end_epoch
id_val_step
show_embeddings
training_step
val_step
end_epoch(val_losses=None)[source]

id_val_step(losses)[source]

show_embeddings(categories)[source]

training_step(losses)[source]

val_step(losses)[source]

Modules

morpheus.models.dfencoder.ae_module

morpheus.models.dfencoder.autoencoder

morpheus.models.dfencoder.dataframe

morpheus.models.dfencoder.dataloader

morpheus.models.dfencoder.distributed_ae

morpheus.models.dfencoder.logging

morpheus.models.dfencoder.multiprocessing

morpheus.models.dfencoder.scalers

Previous morpheus.models
Next morpheus.models.dfencoder.ae_module
© Copyright 2024, NVIDIA. Last updated on Apr 25, 2024.