NVIDIA Docs Hub NVIDIA Morpheus DFP Training Pipe Module

DFP Training Pipe Module

This module function consolidates multiple DFP pipeline modules relevant to the training process into a single module.

Configurable Parameters

Key	Type	Description	Example Value	Default Value
`timestamp_column_name`	str	Name of the timestamp column used in the data.	“timestamp”	`-`
`cache_dir`	str	Directory to cache the rolling window data.	“/tmp/cache”	`-`
`batching_options`	dict	Options for batching files.	See Below	`-`
`user_splitting_options`	dict	Options for splitting data by user.	See Below	`-`
`stream_aggregation_options`	dict	Options for aggregating data by stream.	See Below	`-`
`preprocessing_options`	dict	Options for preprocessing the data.	`-`	`-`
`dfencoder_options`	dict	Options for configuring the data frame encoder, used for training the model.	See Below	`-`
`mlflow_writer_options`	dict	Options for the MLflow model writer, which is responsible for saving the trained model.	See Below	`-`

batching_options

Key	Type	Description	Example Value	Default Value
`end_time`	str	End time of the time range to process.	“2023-03-01T00:00:00”	`-`
`iso_date_regex_pattern`	str	ISO date regex pattern.	“\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}”	`-`
`parser_kwargs`	dict	Keyword arguments to pass to the parser.	{}	`-`
`period`	str	Time period to batch the data.	“1min”	`-`
`sampling_rate_s`	float	Sampling rate in seconds.	60	`-`
`start_time`	str	Start time of the time range to process.	“2023-02-01T00:00:00”	`-`

user_splitting_options

Key	Type	Description	Example Value	Default Value
`fallback_username`	str	Fallback user to use if no model is found for a user.	“generic”	`-`
`include_generic`	bool	Include generic models in the results.	true	`-`
`include_individual`	bool	Include individual models in the results.	true	`-`
`only_users`	List[str]	List of users to include in the results.	[]	`-`
`skip_users`	List[str]	List of users to exclude from the results.	[]	`-`
`userid_column_name`	str	Column name for the user ID.	“user_id”	`-`

stream_aggregation_options

Key	Type	Description	Example Value	Default Value
`cache_mode`	string	The user ID to use if the user ID is not found	“batch”	`batch`
`min_history`	int	Minimum history to trigger a new training event	1	`1`
`max_history`	int	Maximum history to include in a new training event	0	`0`
`timestamp_column_name`	string	Name of the column containing timestamps	‘timestamp’	`timestamp`
`aggregation_span`	string	Lookback timespan for training data in a new training event	“60d”	`60d`
`cache_to_disk`	bool	Whether or not to cache streaming data to disk	false	`false`
`cache_dir`	string	Directory to use for caching streaming data	“./.cache”	`./.cache`

dfencoder_options

Parameter	Type	Description	Example Value	Default Value
`feature_columns`	list	List of feature columns to train on	[“column1”, “column2”, “column3”]	`-`
`epochs`	int	Number of epochs to train for	50	`-`
`model_kwargs`	dict	Keyword arguments to pass to the model	{“encoder_layers”: [64, 32], “decoder_layers”: [32, 64], “activation”: “relu”, “swap_p”: 0.1, “lr”: 0.001, “lr_decay”: 0.9, “batch_size”: 32, “verbose”: 1, “optimizer”: “adam”, “scalar”: “min_max”, “min_cats”: 10, “progress_bar”: false, “device”: “cpu”}	`-`
`validation_size`	float	Size of the validation set	0.1	`-`

mlflow_writer_options

Key	Type	Description	Example Value	Default Value
`conda_env`	string	Conda environment for the model	“path/to/conda_env.yml”	`[Required]`
`databricks_permissions`	dictionary	Permissions for the model	See Below	`None`
`experiment_name_formatter`	string	Formatter for the experiment name	“experiment_name_{timestamp}”	`[Required]`
`model_name_formatter`	string	Formatter for the model name	“model_name_{timestamp}”	`[Required]`
`timestamp_column_name`	string	Name of the timestamp column	“timestamp”	`timestamp`

© Copyright 2023, NVIDIA. Last updated on Apr 11, 2023.