DFP Training Pipe Module

This module function consolidates multiple DFP pipeline modules relevant to the training process into a single module.

Key

Type

Description

Example Value

Default Value

timestamp_column_name

str

Name of the timestamp column used in the data.

“timestamp”

-

cache_dir

str

Directory to cache the rolling window data.

“/tmp/cache”

-

batching_options

dict

Options for batching files.

See Below

-

user_splitting_options

dict

Options for splitting data by user.

See Below

-

stream_aggregation_options

dict

Options for aggregating data by stream.

See Below

-

preprocessing_options

dict

Options for preprocessing the data.

-

-

dfencoder_options

dict

Options for configuring the data frame encoder, used for training the model.

See Below

-

mlflow_writer_options

dict

Options for the MLflow model writer, which is responsible for saving the trained model.

See Below

-

Key

Type

Description

Example Value

Default Value

end_time

str

End time of the time range to process.

“2023-03-01T00:00:00”

-

iso_date_regex_pattern

str

ISO date regex pattern.

“\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}”

-

parser_kwargs

dict

Keyword arguments to pass to the parser.

{}

-

period

str

Time period to batch the data.

“1min”

-

sampling_rate_s

float

Sampling rate in seconds.

60

-

start_time

str

Start time of the time range to process.

“2023-02-01T00:00:00”

-

Key

Type

Description

Example Value

Default Value

fallback_username

str

Fallback user to use if no model is found for a user.

“generic”

-

include_generic

bool

Include generic models in the results.

true

-

include_individual

bool

Include individual models in the results.

true

-

only_users

List[str]

List of users to include in the results.

[]

-

skip_users

List[str]

List of users to exclude from the results.

[]

-

userid_column_name

str

Column name for the user ID.

“user_id”

-

Key

Type

Description

Example Value

Default Value

cache_mode

string

The user ID to use if the user ID is not found

“batch”

batch

min_history

int

Minimum history to trigger a new training event

1

1

max_history

int

Maximum history to include in a new training event

0

0

timestamp_column_name

string

Name of the column containing timestamps

‘timestamp’

timestamp

aggregation_span

string

Lookback timespan for training data in a new training event

“60d”

60d

cache_to_disk

bool

Whether or not to cache streaming data to disk

false

false

cache_dir

string

Directory to use for caching streaming data

“./.cache”

./.cache

Parameter

Type

Description

Example Value

Default Value

feature_columns

list

List of feature columns to train on

[“column1”, “column2”, “column3”]

-

epochs

int

Number of epochs to train for

50

-

model_kwargs

dict

Keyword arguments to pass to the model

{“encoder_layers”: [64, 32], “decoder_layers”: [32, 64], “activation”: “relu”, “swap_p”: 0.1, “lr”: 0.001, “lr_decay”: 0.9, “batch_size”: 32, “verbose”: 1, “optimizer”: “adam”, “scalar”: “min_max”, “min_cats”: 10, “progress_bar”: false, “device”: “cpu”}

-

validation_size

float

Size of the validation set

0.1

-

Key

Type

Description

Example Value

Default Value

conda_env

string

Conda environment for the model

“path/to/conda_env.yml”

[Required]

databricks_permissions

dictionary

Permissions for the model

See Below

None

experiment_name_formatter

string

Formatter for the experiment name

“experiment_name_{timestamp}”

[Required]

model_name_formatter

string

Formatter for the model name

“model_name_{timestamp}”

[Required]

timestamp_column_name

string

Name of the timestamp column

“timestamp”

timestamp

© Copyright 2023, NVIDIA. Last updated on Apr 11, 2023.