DFP Training Pipe Module
This module function consolidates multiple DFP pipeline modules relevant to the training process into a single module.
Key |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
timestamp_column_name |
str | Name of the timestamp column used in the data. | “timestamp” | - |
cache_dir |
str | Directory to cache the rolling window data. | “/tmp/cache” | - |
batching_options |
dict | Options for batching files. | See Below | - |
user_splitting_options |
dict | Options for splitting data by user. | See Below | - |
stream_aggregation_options |
dict | Options for aggregating data by stream. | See Below | - |
preprocessing_options |
dict | Options for preprocessing the data. | - |
- |
dfencoder_options |
dict | Options for configuring the data frame encoder, used for training the model. | See Below | - |
mlflow_writer_options |
dict | Options for the MLflow model writer, which is responsible for saving the trained model. | See Below | - |
Key |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
end_time |
str | End time of the time range to process. | “2023-03-01T00:00:00” | - |
iso_date_regex_pattern |
str | ISO date regex pattern. | “\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}” | - |
parser_kwargs |
dict | Keyword arguments to pass to the parser. | {} | - |
period |
str | Time period to batch the data. | “1min” | - |
sampling_rate_s |
float | Sampling rate in seconds. | 60 | - |
start_time |
str | Start time of the time range to process. | “2023-02-01T00:00:00” | - |
Key |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
fallback_username |
str | Fallback user to use if no model is found for a user. | “generic” | - |
include_generic |
bool | Include generic models in the results. | true | - |
include_individual |
bool | Include individual models in the results. | true | - |
only_users |
List[str] | List of users to include in the results. | [] | - |
skip_users |
List[str] | List of users to exclude from the results. | [] | - |
userid_column_name |
str | Column name for the user ID. | “user_id” | - |
Key |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
cache_mode |
string | Mode for managing user cache. Setting to batch flushes cache once trigger conditions are met. Otherwise, continue to aggregate user’s history. |
“batch” | batch |
min_history |
int | Minimum history to trigger a new training event | 1 | 1 |
max_history |
int | Maximum history to include in a new training event | 0 | 0 |
timestamp_column_name |
string | Name of the column containing timestamps | ‘timestamp’ | timestamp |
aggregation_span |
string | Lookback timespan for training data in a new training event | “60d” | 60d |
cache_to_disk |
bool | Whether or not to cache streaming data to disk | false | false |
cache_dir |
string | Directory to use for caching streaming data | “./.cache” | ./.cache |
Parameter |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
feature_columns |
list | List of feature columns to train on | [“column1”, “column2”, “column3”] | - |
epochs |
int | Number of epochs to train for | 50 | - |
model_kwargs |
dict | Keyword arguments to pass to the model | {“encoder_layers”: [64, 32], “decoder_layers”: [32, 64], “activation”: “relu”, “swap_p”: 0.1, “lr”: 0.001, “lr_decay”: 0.9, “batch_size”: 32, “verbose”: 1, “optimizer”: “adam”, “scalar”: “min_max”, “min_cats”: 10, “progress_bar”: false, “device”: “cpu”} | - |
validation_size |
float | Size of the validation set | 0.1 | - |
Key |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
conda_env |
string | Conda environment for the model | “path/to/conda_env.yml” | [Required] |
databricks_permissions |
dictionary | Permissions for the model | See Below | None |
experiment_name_formatter |
string | Formatter for the experiment name | “experiment_name_{timestamp}” | [Required] |
model_name_formatter |
string | Formatter for the model name | “model_name_{timestamp}” | [Required] |
timestamp_column_name |
string | Name of the timestamp column | “timestamp” | timestamp |