DFP Training Pipe Module
This module function consolidates multiple DFP pipeline modules relevant to the training process into a single module.
Key |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
timestamp_column_name |
str |
Name of the timestamp column used in the data. | "timestamp" |
- |
cache_dir |
str |
Directory to cache the rolling window data. | "/tmp/cache" |
- |
batching_options |
dict |
Options for batching files. | Refer Below | - |
user_splitting_options |
dict |
Options for splitting data by user. | Refer Below | - |
stream_aggregation_options |
dict |
Options for aggregating data by stream. | Refer Below | - |
preprocessing_options |
dict |
Options for preprocessing the data. | - |
- |
dfencoder_options |
dict |
Options for configuring the data frame encoder, used for training the model. | Refer Below | - |
mlflow_writer_options |
dict |
Options for the MLflow model writer, which is responsible for saving the trained model. | Refer Below | - |
Key |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
end_time |
str |
End time of the time range to process. | "2023-03-01T00:00:00" |
- |
iso_date_regex_pattern |
str |
ISO date regex pattern. | "\\\\d{4}-\\\\d{2}-\\\\d{2}T\\\\d{2}:\\\\d{2}:\\\\d{2}" |
- |
parser_kwargs |
dict |
Keyword arguments to pass to the parser. | {} |
- |
period |
str |
Time period to batch the data. | "1min" |
- |
sampling_rate_s |
float |
Sampling rate in seconds. | 60 |
- |
start_time |
str |
Start time of the time range to process. | "2023-02-01T00:00:00" |
- |
Key |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
fallback_username |
str |
Fallback user to use if no model is found for a user. | "generic" |
- |
include_generic |
bool |
Include generic models in the results. | true |
- |
include_individual |
bool |
Include individual models in the results. | true |
- |
only_users |
list[str] |
List of users to include in the results. | [] |
- |
skip_users |
list[str] |
List of users to exclude from the results. | [] |
- |
userid_column_name |
str |
Column name for the user ID. | "user_id" |
- |
Key |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
cache_mode |
str |
Mode for managing user cache. Setting to batch flushes cache once trigger conditions are met. Otherwise, continue to aggregate user’s history. |
"batch" |
"batch" |
min_history |
int |
Minimum history to trigger a new training event | 1 |
1 |
max_history |
int |
Maximum history to include in a new training event | 0 |
0 |
timestamp_column_name |
str |
Name of the column containing timestamps | 'timestamp' |
'timestamp' |
aggregation_span |
str |
Look back time span for training data in a new training event | "60d" |
60d |
cache_to_disk |
bool |
Whether or not to cache streaming data to disk | false |
false |
cache_dir |
str |
Directory to use for caching streaming data | "./.cache" |
"./.cache" |
Parameter |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
feature_columns |
list |
List of feature columns to train on | ["column1", "column2", "column3"] |
- |
epochs |
int |
Number of epochs to train for | 50 |
- |
model_kwargs |
dict |
Keyword arguments to pass to the model | {"encoder_layers": [64, 32], "decoder_layers": [32, 64], "activation": "relu", "swap_p": 0.1, "lr": 0.001, "lr_decay": 0.9, "batch_size": 32, "verbose": 1, "optimizer": "adam", "scalar": "min_max", "min_cats": 10, "progress_bar": false, "device": "cpu"} |
- |
validation_size |
float |
Size of the validation set | 0.1 |
- |
Key |
Type |
Description |
Example Value |
Default Value |
---|---|---|---|---|
conda_env |
str |
Conda environment for the model | "path/to/conda_env.yml" |
[Required] |
databricks_permissions |
dict |
Permissions for the model | - | None |
experiment_name_formatter |
str |
Formatter for the experiment name | "experiment_name_{timestamp}" |
[Required] |
model_name_formatter |
str |
Formatter for the model name | "model_name_{timestamp}" |
[Required] |
timestamp_column_name |
str |
Name of the timestamp column | "timestamp" |
"timestamp" |