- class AutoEncoder(*args, **kwargs)[source]
Bases:
torch.nn.Module
Methods
__call__
(*args, **kwargs)Call self as a function.
compute_baseline_performance
(in_, out_)Baseline performance is computed by generating a strong
compute_loss_from_targets
(num, bin, cat, ...)Computes the loss from targets.
decode_outputs_to_df
(num, bin, cat)Converts the model outputs of the numerical, binary, and categorical features back into a pandas dataframe.
df_predict
(df)Runs end-to-end model.
encode_input
(df)Handles raw df inputs.
fit
(train_data[, epochs, val_data, ...])Does training in the specified mode (indicated by self.distrivuted_training).
Returns a per-row loss of the input dataframe.
Run the input dataframe
df
through the autoencoder to get the recovery losses by feature type (numerical/boolean/categorical).records and outputs all internal representations of input df as row-wise vectors.
get_representation
(df[, layer])Computes latent feature vector from hidden layer given input dataframe.
get_results_from_dataset
(dataset, preloaded_df)Returns a pandas dataframe of inference results and losses for a given dataset.
prepare_df
(df)Does data preparation on copy of input dataframe.
preprocess_data
(df, shuffle_rows_in_batch, ...)Preprocesses a pandas dataframe
df
for input into the autoencoder model.preprocess_train_data
(df[, ...])Wrapper function round
self.preprocess_data
feeding in the args suitable for a training set.preprocess_validation_data
(df[, ...])Wrapper function round
self.preprocess_data
feeding in the args suitable for a validation set.train_epoch
(n_updates, input_df, df[, pbar])Run regular epoch.
train_megabatch_epoch
(n_updates, df)Run epoch doing 'megabatch' updates, preprocessing data in large chunks.
build_input_tensor
compute_loss
compute_targets
create_binary_col_max
create_categorical_col_max
create_numerical_col_max
do_backward
get_anomaly_score_with_losses
get_results
get_scaler
get_variable_importance
return_feature_names
scale_losses
- compute_baseline_performance(in_, out_)[source]
- Baseline performance is computed by generating a strong
- This should be roughly the loss we expect when the encoder degenerates
- Returns net loss on baseline performance computation
prediction for the identity function (predicting input==output) with a swapped (noisy) input, and computing the loss against the unaltered original data.
into the identity function solution.
(sum of all losses)
- compute_loss_from_targets(num, bin, cat, num_target, bin_target, cat_target, should_log=True, _id=False)[source]
Computes the loss from targets.
- Parameters:
- numtorch.Tensor
- bintorch.Tensor
- catList[torch.Tensor]
- num_targettorch.Tensor
- bin_targettorch.Tensor
- cat_targetList[torch.Tensor]
- should_logbool, optional
- _idbool, optional
numerical data tensor
binary data tensor
list of categorical data tensors
target numerical data tensor
target binary data tensor
list of target categorical data tensors
whether to log the loss in self.logger, by default True
whether the current step is an id validation step (for logging), by default False
- Returns:
- Tuple[Union[float, List[float]]]
A tuple containing the mean mse/bce losses, list of mean cce losses, and mean net loss
- decode_outputs_to_df(num, bin, cat)[source]
Converts the model outputs of the numerical, binary, and categorical features back into a pandas dataframe.
- df_predict(df)[source]
Runs end-to-end model. Interprets output and creates a dataframe. Outputs dataframe with same shape as input containing model predictions.
- encode_input(df)[source]
Handles raw df inputs. Passes categories through embedding layers.
- fit(train_data, epochs=1, val_data=None, run_validation=False, use_val_for_loss_stats=False, rank=None, world_size=None)[source]
Does training in the specified mode (indicated by self.distrivuted_training).
- Parameters:
- train_datapandas.DataFrame (centralized) or torch.utils.data.DataLoader (distributed)
- epochsint, optional
- val_datapandas.DataFrame (centralized) or torch.utils.data.DataLoader (distributed), optional
- run_validationbool, optional
- use_val_for_loss_statsbool, optional
- rankint, optional
- world_sizeint, optional
Data for training.
Number of epochs to run training, by default 1.
Data for validation and computing loss stats, by default None.
Whether to collect validation loss for each epoch during training, by default False.
whether to use the validation set for loss statistics collection (for z score calculation), by default False.
The rank of the current process, by default None. Required for distributed training.
The total number of processes, by default None. Required for distributed training.
- Raises:
- TypeError
- ValueError
- TypeError
If train_data is not a pandas dataframe in centralized training mode.
If rank and world_size not provided in distributed training mode.
If train_data is not a pandas dataframe or a torch.utils.data.DataLoader or a torch.utils.data.Dataset in distributed training mode.
- get_anomaly_score(df)[source]
Returns a per-row loss of the input dataframe. Does not corrupt inputs.
- get_anomaly_score_losses(df)[source]
Run the input dataframe
df
through the autoencoder to get the recovery losses by feature type (numerical/boolean/categorical).- get_deep_stack_features(df)[source]
records and outputs all internal representations of input df as row-wise vectors. Output is 2-d array with len() == len(df)
- get_representation(df, layer=0)[source]
Computes latent feature vector from hidden layer given input dataframe.
argument layer (int) specifies which layer to get. by default (layer=0), returns the “encoding” layer. layer < 0 counts layers back from encoding layer. layer > 0 counts layers forward from encoding layer.
- get_results_from_dataset(dataset, preloaded_df, return_abs=False)[source]
Returns a pandas dataframe of inference results and losses for a given dataset. Note. this function requires the whole inference set to be in loaded into memory as a pandas df
- Parameters:
- datasettorch.utils.data.Dataset
- preloaded_dfpd.DataFrame
- return_absbool, optional
dataset for inference
a pandas dataframe that contains the original data
whether the absolute value of the loss scalers should be returned, by default False
- Returns:
- pd.DataFrame
inference result with losses of each feature
- prepare_df(df)[source]
Does data preparation on copy of input dataframe.
- Parameters:
- dfpandas.DataFrame
The pandas dataframe to process
- Returns:
- pandas.DataFrame
A processed copy of df.
- preprocess_data(df, shuffle_rows_in_batch, include_original_input_tensor, include_swapped_input_by_feature_type)[source]
Preprocesses a pandas dataframe
df
for input into the autoencoder model.- Parameters:
- dfpandas.DataFrame
- shuffle_rows_in_batchbool
- include_original_input_tensorbool
- include_swapped_input_by_feature_typebool
The input dataframe to preprocess.
Whether to shuffle the rows of the dataframe before processing.
Whether to process the df into an input tensor without swapping and include it in the returned data dict. Note. Training required only the swapped input tensor while validation can use both.
Whether to process the swapped df into num/bin/cat feature tensors and include them in the returned data dict. This is useful for baseline performance evaluation for validation.
- Returns:
- Dict[str, Union[int, torch.Tensor]]
A dict containing the preprocessed input data and targets by feature type.
- preprocess_train_data(df, shuffle_rows_in_batch=True)[source]
Wrapper function round
self.preprocess_data
feeding in the args suitable for a training set.- preprocess_validation_data(df, shuffle_rows_in_batch=False)[source]
Wrapper function round
self.preprocess_data
feeding in the args suitable for a validation set.- train_epoch(n_updates, input_df, df, pbar=None)[source]
Run regular epoch.
- train_megabatch_epoch(n_updates, df)[source]
Run epoch doing ‘megabatch’ updates, preprocessing data in large chunks.