DFP Training Pipe Module
This module function consolidates multiple DFP pipeline modules relevant to the training process into a single module.
Key |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
|
str |
Name of the timestamp column used in the data. |
“timestamp” |
|
|
str |
Directory to cache the rolling window data. |
“/tmp/cache” |
|
|
dict |
Options for batching files. |
See Below |
|
|
dict |
Options for splitting data by user. |
See Below |
|
|
dict |
Options for aggregating data by stream. |
See Below |
|
|
dict |
Options for preprocessing the data. |
|
|
|
dict |
Options for configuring the data frame encoder, used for training the model. |
See Below |
|
|
dict |
Options for the MLflow model writer, which is responsible for saving the trained model. |
See Below |
|
Key |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
|
str |
End time of the time range to process. |
“2023-03-01T00:00:00” |
|
|
str |
ISO date regex pattern. |
“\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}” |
|
|
dict |
Keyword arguments to pass to the parser. |
{} |
|
|
str |
Time period to batch the data. |
“1min” |
|
|
float |
Sampling rate in seconds. |
60 |
|
|
str |
Start time of the time range to process. |
“2023-02-01T00:00:00” |
|
Key |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
|
str |
Fallback user to use if no model is found for a user. |
“generic” |
|
|
bool |
Include generic models in the results. |
true |
|
|
bool |
Include individual models in the results. |
true |
|
|
List[str] |
List of users to include in the results. |
[] |
|
|
List[str] |
List of users to exclude from the results. |
[] |
|
|
str |
Column name for the user ID. |
“user_id” |
|
Key |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
|
string |
The user ID to use if the user ID is not found |
“batch” |
|
|
int |
Minimum history to trigger a new training event |
1 |
|
|
int |
Maximum history to include in a new training event |
0 |
|
|
string |
Name of the column containing timestamps |
‘timestamp’ |
|
|
string |
Lookback timespan for training data in a new training event |
“60d” |
|
|
bool |
Whether or not to cache streaming data to disk |
false |
|
|
string |
Directory to use for caching streaming data |
“./.cache” |
|
Parameter |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
|
list |
List of feature columns to train on |
[“column1”, “column2”, “column3”] |
|
|
int |
Number of epochs to train for |
50 |
|
|
dict |
Keyword arguments to pass to the model |
{“encoder_layers”: [64, 32], “decoder_layers”: [32, 64], “activation”: “relu”, “swap_p”: 0.1, “lr”: 0.001, “lr_decay”: 0.9, “batch_size”: 32, “verbose”: 1, “optimizer”: “adam”, “scalar”: “min_max”, “min_cats”: 10, “progress_bar”: false, “device”: “cpu”} |
|
|
float |
Size of the validation set |
0.1 |
|
Key |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
|
string |
Conda environment for the model |
“path/to/conda_env.yml” |
|
|
dictionary |
Permissions for the model |
See Below |
|
|
string |
Formatter for the experiment name |
“experiment_name_{timestamp}” |
|
|
string |
Formatter for the model name |
“model_name_{timestamp}” |
|
|
string |
Name of the timestamp column |
“timestamp” |
|