DFP Training Pipe Module

This module function consolidates multiple DFP pipeline modules relevant to the training process into a single module.

Key

Type

Description

Example Value

Default Value

timestamp_column_name str Name of the timestamp column used in the data. “timestamp” -
cache_dir str Directory to cache the rolling window data. “/tmp/cache” -
batching_options dict Options for batching files. See Below -
user_splitting_options dict Options for splitting data by user. See Below -
stream_aggregation_options dict Options for aggregating data by stream. See Below -
preprocessing_options dict Options for preprocessing the data. - -
dfencoder_options dict Options for configuring the data frame encoder, used for training the model. See Below -
mlflow_writer_options dict Options for the MLflow model writer, which is responsible for saving the trained model. See Below -

Key

Type

Description

Example Value

Default Value

end_time str End time of the time range to process. “2023-03-01T00:00:00” -
iso_date_regex_pattern str ISO date regex pattern. “\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}” -
parser_kwargs dict Keyword arguments to pass to the parser. {} -
period str Time period to batch the data. “1min” -
sampling_rate_s float Sampling rate in seconds. 60 -
start_time str Start time of the time range to process. “2023-02-01T00:00:00” -

Key

Type

Description

Example Value

Default Value

fallback_username str Fallback user to use if no model is found for a user. “generic” -
include_generic bool Include generic models in the results. true -
include_individual bool Include individual models in the results. true -
only_users List[str] List of users to include in the results. [] -
skip_users List[str] List of users to exclude from the results. [] -
userid_column_name str Column name for the user ID. “user_id” -

Key

Type

Description

Example Value

Default Value

cache_mode string The user ID to use if the user ID is not found “batch” batch
min_history int Minimum history to trigger a new training event 1 1
max_history int Maximum history to include in a new training event 0 0
timestamp_column_name string Name of the column containing timestamps ‘timestamp’ timestamp
aggregation_span string Lookback timespan for training data in a new training event “60d” 60d
cache_to_disk bool Whether or not to cache streaming data to disk false false
cache_dir string Directory to use for caching streaming data “./.cache” ./.cache

Parameter

Type

Description

Example Value

Default Value

feature_columns list List of feature columns to train on [“column1”, “column2”, “column3”] -
epochs int Number of epochs to train for 50 -
model_kwargs dict Keyword arguments to pass to the model {“encoder_layers”: [64, 32], “decoder_layers”: [32, 64], “activation”: “relu”, “swap_p”: 0.1, “lr”: 0.001, “lr_decay”: 0.9, “batch_size”: 32, “verbose”: 1, “optimizer”: “adam”, “scalar”: “min_max”, “min_cats”: 10, “progress_bar”: false, “device”: “cpu”} -
validation_size float Size of the validation set 0.1 -

Key

Type

Description

Example Value

Default Value

conda_env string Conda environment for the model “path/to/conda_env.yml” [Required]
databricks_permissions dictionary Permissions for the model See Below None
experiment_name_formatter string Formatter for the experiment name “experiment_name_{timestamp}” [Required]
model_name_formatter string Formatter for the model name “model_name_{timestamp}” [Required]
timestamp_column_name string Name of the timestamp column “timestamp” timestamp
Previous DFP Split Users Module
Next DFP Training Module
© Copyright 2024, NVIDIA. Last updated on Apr 25, 2024.