morpheus.models.dfencoder
- class AEModule(*args, **kwargs)[source]
Auto Encoder Pytorch Module.
Methods
__call__
(*args, **kwargs)Call self as a function. build
(numeric_fts, binary_fts, categorical_fts)Constructs the autoencoder model. decode
(x[, layers])Decodes the input using the decoder layers and computes the outputs. encode
(x[, layers])Encodes the input using the encoder layers. forward
(input)Passes the input through the model and returns the outputs. - build(numeric_fts, binary_fts, categorical_fts)[source]
Constructs the autoencoder model.
- Parameters
- numeric_fts
- binary_fts
- categorical_fts
The names of the numeric features.
The names of the binary features.
The dictionary mapping categorical feature names to dictionaries containing the categories of the feature.
- decode(x, layers=None)[source]
Decodes the input using the decoder layers and computes the outputs.
- Parameters
- x
- layers
The encoded input tensor to decode.
The number of layers to use for decoding. Defaults to None, will use all decoder layers.
- Returns
- tuple of Union[torch.Tensor, List[torch.Tensor]]
A tuple containing the numeric (Tensor), binary (Tensor), and categorical outputs (List[torch.Tensor]) of the model.
- encode(x, layers=None)[source]
Encodes the input using the encoder layers.
- Parameters
- x
- layers
The input tensor to encode.
The number of layers to use for encoding. Defaults to None, will use all encoder layers.
- Returns
- torch.Tensor
The encoded output tensor.
- forward(input)[source]
Passes the input through the model and returns the outputs.
- Parameters
- input
The input tensor.
- Returns
- tuple of Union[torch.Tensor, List[torch.Tensor]]
A tuple containing the numeric (Tensor), binary (Tensor), and categorical outputs (List[torch.Tensor]) of the model.
- class AutoEncoder(*args, **kwargs)[source]
Methods
__call__
(*args, **kwargs)Call self as a function. compute_baseline_performance
(in_, out_)Baseline performance is computed by generating a strong compute_loss_from_targets
(num, bin, cat, ...)Computes the loss from targets. decode_outputs_to_df
(num, bin, cat)Converts the model outputs of the numerical, binary, and categorical features back into a pandas dataframe. df_predict
(df)Runs end-to-end model. encode_input
(df)Handles raw df inputs. fit
(training_data[, rank, world_size, ...])Fit the model in a distributed or centralized fashion, depending on self.distributed_training with early stopping based on validation loss. get_anomaly_score
(df)Returns a per-row loss of the input dataframe. get_anomaly_score_losses
(df)Run the input dataframe df
through the autoencoder to get the recovery losses by feature type (numerical/boolean/categorical).get_deep_stack_features
(df)records and outputs all internal representations of input df as row-wise vectors. get_representation
(df[, layer])Computes latent feature vector from hidden layer given input dataframe. get_results_from_dataset
(dataset, preloaded_df)Returns a pandas dataframe of inference results and losses for a given dataset. prepare_df
(df)Does data preparation on copy of input dataframe. preprocess_data
(df, shuffle_rows_in_batch, ...)Preprocesses a pandas dataframe df
for input into the autoencoder model.preprocess_training_data
(df[, ...])Wrapper function round self.preprocess_data
feeding in the args suitable for a training set.preprocess_validation_data
(df[, ...])Wrapper function round self.preprocess_data
feeding in the args suitable for a validation set.build_input_tensor compute_loss compute_targets create_binary_col_max create_categorical_col_max create_numerical_col_max get_anomaly_score_with_losses get_feature_count get_results get_scaler get_variable_importance return_feature_names scale_losses - build_input_tensor(df)[source]
- compute_baseline_performance(in_, out_)[source]
- Baseline performance is computed by generating a strong
- This should be roughly the loss we expect when the encoder degenerates
- Returns net loss on baseline performance computation
prediction for the identity function (predicting input==output) with a swapped (noisy) input, and computing the loss against the unaltered original data.
into the identity function solution.
(sum of all losses)
- compute_loss(num, bin, cat, target_df, should_log=True, _id=False)[source]
- compute_loss_from_targets(num, bin, cat, num_target, bin_target, cat_target, should_log=True, _id=False)[source]
Computes the loss from targets.
- Parameters
- num
- bin
- cat
- num_target
- bin_target
- cat_target
- should_log
- _id
numerical data tensor
binary data tensor
list of categorical data tensors
target numerical data tensor
target binary data tensor
list of target categorical data tensors
whether to log the loss in self.logger, by default True
whether the current step is an id validation step (for logging), by default False
- Returns
- Tuple[Union[float, List[float]]]
A tuple containing the mean mse/bce losses, list of mean cce losses, and mean net loss
- compute_targets(df)[source]
- create_binary_col_max(bin_names, bce_loss)[source]
- create_categorical_col_max(cat_names, cce_loss)[source]
- create_numerical_col_max(num_names, mse_loss)[source]
- decode_outputs_to_df(num, bin, cat)[source]
Converts the model outputs of the numerical, binary, and categorical features back into a pandas dataframe.
- df_predict(df)[source]
Runs end-to-end model. Interprets output and creates a dataframe. Outputs dataframe with same shape as input containing model predictions.
- encode_input(df)[source]
Handles raw df inputs. Passes categories through embedding layers.
- fit(training_data, rank=0, world_size=1, epochs=1, validation_data=None, run_validation=False, use_val_for_loss_stats=False)[source]
Fit the model in a distributed or centralized fashion, depending on self.distributed_training with early stopping based on validation loss. If run_validation is True, the validation_dataset will be used for validation during training and early stopping will be applied based on patience argument.
- Parameters
- training_data
- rank
- world_size
- epochs
- validation_data
- run_validation
- use_val_for_loss_stats
data object of training data
the rank of the current process
the total number of processes
the number of epochs to train for, by default 1
the validation data object (with __iter__() that yields a batch at a time), by default None
whether to perform validation during training, by default False
whether to populate loss stats in the main process (rank 0) for z-score calculation using the validation set. If set to False, loss stats would be populated using the train_dataloader, which can be slow due to data size. By default False, but using the validation set to populate loss stats is strongly recommended (for both efficiency and model efficacy).
- Raises
- ValueError
If run_validation or use_val_for_loss_stats is True but val is not provided.
- get_anomaly_score(df)[source]
Returns a per-row loss of the input dataframe. Does not corrupt inputs.
- get_anomaly_score_losses(df)[source]
Run the input dataframe
df
through the autoencoder to get the recovery losses by feature type (numerical/boolean/categorical).- get_anomaly_score_with_losses(df)[source]
- get_deep_stack_features(df)[source]
records and outputs all internal representations of input df as row-wise vectors. Output is 2-d array with len() == len(df)
- get_feature_count()[source]
- get_representation(df, layer=0)[source]
Computes latent feature vector from hidden layer given input dataframe.
argument layer (int) specifies which layer to get. by default (layer=0), returns the “encoding” layer. layer < 0 counts layers back from encoding layer. layer > 0 counts layers forward from encoding layer.
- get_results(df, return_abs=False)[source]
- get_results_from_dataset(dataset, preloaded_df, return_abs=False)[source]
Returns a pandas dataframe of inference results and losses for a given dataset. Note. this function requires the whole inference set to be in loaded into memory as a pandas df
- Parameters
- dataset
- preloaded_df
- return_abs
dataset for inference
a pandas dataframe that contains the original data
whether the absolute value of the loss scalers should be returned, by default False
- Returns
- pd.DataFrame
inference result with losses of each feature
- get_scaler(name)[source]
- get_variable_importance(num_names, cat_names, bin_names, mse_loss, bce_loss, cce_loss, cloudtrail_df)[source]
- prepare_df(df)[source]
Does data preparation on copy of input dataframe.
- Parameters
- df
The pandas dataframe to process
- Returns
- pandas.DataFrame
A processed copy of df.
- preprocess_data(df, shuffle_rows_in_batch, include_original_input_tensor, include_swapped_input_by_feature_type)[source]
Preprocesses a pandas dataframe
df
for input into the autoencoder model.- Parameters
- df
- shuffle_rows_in_batch
- include_original_input_tensor
- include_swapped_input_by_feature_type
The input dataframe to preprocess.
Whether to shuffle the rows of the dataframe before processing.
Whether to process the df into an input tensor without swapping and include it in the returned data dict. Note. Training required only the swapped input tensor while validation can use both.
Whether to process the swapped df into num/bin/cat feature tensors and include them in the returned data dict. This is useful for baseline performance evaluation for validation.
- Returns
- Dict[str, Union[int, torch.Tensor]]
A dict containing the preprocessed input data and targets by feature type.
- preprocess_training_data(df, shuffle_rows_in_batch=True)[source]
Wrapper function round
self.preprocess_data
feeding in the args suitable for a training set.- preprocess_validation_data(df, shuffle_rows_in_batch=False)[source]
Wrapper function round
self.preprocess_data
feeding in the args suitable for a validation set.- return_feature_names()[source]
- scale_losses(mse, bce, cce)[source]
- class BasicLogger(fts, baseline_loss=0.0)[source]
A minimal class for logging training progress.
Methods
end_epoch id_val_step training_step val_step - end_epoch()[source]
- id_val_step(losses)[source]
- training_step(losses)[source]
- val_step(losses)[source]
- class CompleteLayer(*args, **kwargs)[source]
Impliments a layer with linear transformation and optional activation and dropout.
Methods
__call__
(*args, **kwargs)Call self as a function. forward
(x)Performs a forward pass through the CompleteLayer object. interpret_activation
([act])Interprets the name of the activation function and returns the appropriate PyTorch function. - forward(x)[source]
Performs a forward pass through the CompleteLayer object.
- Parameters
- x
The input tensor to the CompleteLayer object.
- Returns
- torch.Tensor
The output tensor of the CompleteLayer object after processing the input through all layers.
- interpret_activation(act=None)[source]
Interprets the name of the activation function and returns the appropriate PyTorch function.
- Parameters
- act
The name of the activation function to interpret. Defaults to None if no activation function is desired.
- Returns
- PyTorch function
The PyTorch activation function that corresponds to the given name.
- Raises
- Exception
If the activation function name is not recognized.
- class DFEncoderDataLoader(*args, **kwargs)[source]
Methods
__call__
(*args, **kwargs)Call self as a function. get_distributed_training_dataloader_from_dataset
(...)Returns a distributed training DataLoader given a dataset and other arguments. get_distributed_training_dataloader_from_df
(...)A helper funtion to get a distributed training DataLoader given a pandas dataframe. get_distributed_training_dataloader_from_path
(...)A helper funtion to get a distributed training DataLoader given a path to a folder containing data. - static get_distributed_training_dataloader_from_dataset(dataset, rank, world_size, pin_memory=False, num_workers=0)[source]
Returns a distributed training DataLoader given a dataset and other arguments.
- Parameters
- dataset
- rank
- world_size
- pin_memory
- num_workers
The dataset to load the data from.
The rank of the current process.
The number of processes to distribute the data across.
Whether to pin memory when loading data, by default False.
The number of worker processes to use for loading data, by default 0.
- Returns
- DataLoader
The training DataLoader with DistributedSampler for distributed training.
- static get_distributed_training_dataloader_from_df(model, df, rank, world_size, pin_memory=False, num_workers=0)[source]
A helper funtion to get a distributed training DataLoader given a pandas dataframe.
- Parameters
- model
- df
- rank
- world_size
- pin_memory
- num_workers
The autoencoder model used to get relevant params and the preprocessing func.
The pandas dataframe containing the data.
The rank of the current process.
The number of processes to distribute the data across.
Whether to pin memory when loading data, by default False.
The number of worker processes to use for loading data, by default 0.
- Returns
- DFEncoderDataLoader
The training DataLoader with DistributedSampler for distributed training.
- static get_distributed_training_dataloader_from_path(model, data_folder, rank, world_size, load_data_fn=pandas.read_csv, pin_memory=False, num_workers=0)[source]
A helper funtion to get a distributed training DataLoader given a path to a folder containing data.
- Parameters
- model
- data_folder
- rank
- world_size
- load_data_fn
- pin_memory
- num_workers
The autoencoder model used to get relevant params and the preprocessing func.
The path to the folder containing the data.
The rank of the current process.
The number of processes to distribute the data across.
A function for loading data from a provided file path into a pandas.DataFrame, by default pd.read_csv.
Whether to pin memory when loading data, by default False.
The number of worker processes to use for loading data, by default 0.
- Returns
- DFEncoderDataLoader
The training DataLoader with DistributedSampler for distributed training.
- class DataframeDataset(*args, **kwargs)[source]
- Attributes
- batch_size
num_samples
- preprocess_fn
- shuffle_batch_indices
- shuffle_rows_in_batch
Returns the number of samples in the dataset.
Methods
__call__
(*args, **kwargs)Call self as a function. - property batch_size
- property num_samples
Returns the number of samples in the dataset.
- property preprocess_fn
- property shuffle_batch_indices
- property shuffle_rows_in_batch
- class DistributedAutoEncoder(*args, **kwargs)[source]
Methods
__call__ (*args, **kwargs) |
Call self as a function. |
- class EncoderDataFrame(*args, **kwargs)[source]
Methods
__call__
(*args, **kwargs)Call self as a function. swap
([likelihood])Performs random swapping of data. - swap(likelihood=0.15)[source]
Performs random swapping of data.
- Parameters
- likelihood
The probability of a value being randomly replaced with a value from a different row. By default .15
- Returns
- pandas.DataFrame
A copy of the dataframe with equal size.
- class FileSystemDataset(*args, **kwargs)[source]
A dataset class that reads data in batches from a folder and applies preprocessing to each batch. * This class assumes that the data is saved in small csv files in one folder.
- Attributes
- batch_size
num_samples
- preprocess_fn
- shuffle_batch_indices
- shuffle_rows_in_batch
Returns the number of samples in the dataset.
Methods
__call__
(*args, **kwargs)Call self as a function. get_preloaded_data
()Loads all data from the files into memory and returns it as a pandas.DataFrame. - property batch_size
- get_preloaded_data()[source]
Loads all data from the files into memory and returns it as a pandas.DataFrame.
- property num_samples
Returns the number of samples in the dataset.
- property preprocess_fn
- property shuffle_batch_indices
- property shuffle_rows_in_batch
- class GaussRankScaler[source]
So-called “Gauss Rank” scaling. Forces a transformation, uses bins to perform inverse mapping.
Uses sklearn QuantileTransformer to work.
Methods
fit fit_transform inverse_transform transform - fit(x)[source]
- fit_transform(x)[source]
- inverse_transform(x)[source]
- transform(x)[source]
- class IpynbLogger(*args, **kwargs)[source]
Plots Logging Data in jupyter notebook
Methods
end_epoch id_val_step plot_progress training_step val_step - end_epoch(val_losses=None)[source]
- plot_progress()[source]
- class ModifiedScaler[source]
Implements scaling using modified z score. Reference: https://www.ibm.com/docs/el/cognos-analytics/11.1.0?topic=terms-modified-z-score
Methods
fit fit_transform inverse_transform transform - MAD_SCALING_FACTOR = 1.486
- MEANAD_SCALING_FACTOR = 1.253314
- fit(x)[source]
- fit_transform(x)[source]
- inverse_transform(x)[source]
- transform(x)[source]
- class NullScaler[source]
Methods
fit fit_transform inverse_transform transform - fit(x)[source]
- fit_transform(x)[source]
- inverse_transform(x)[source]
- transform(x)[source]
- class StandardScaler[source]
Impliments standard (mean/std) scaling.
Methods
fit fit_transform inverse_transform transform - fit(x)[source]
- fit_transform(x)[source]
- inverse_transform(x)[source]
- transform(x)[source]
- class TensorboardXLogger(logdir='logdir/', run=None, *args, **kwargs)[source]
Methods
end_epoch id_val_step show_embeddings training_step val_step - end_epoch(val_losses=None)[source]
- id_val_step(losses)[source]
- show_embeddings(categories)[source]
- training_step(losses)[source]
- val_step(losses)[source]
Modules