Data Preparation Configuration#

Configure parameters to prepare your data for NVIDIA NeMo Safe Synthesizer, including holdout size and columns for grouping and ordering.

Overview#

In NVIDIA NeMo Safe Synthesizer, “data preparation” refers to training-time configuration that controls how tabular records are grouped, ordered, and serialized into training sequences for Tabular Fine-Tuning.

If you do not have event-driven data, it is unlikely you should need to adjust any parameters in this section.

Parameters#

Parameter

Type

Description

Default

group_training_examples_by

str or None

Column name used to group related records into a single training example (for event-driven data).

None

order_training_examples_by

str or None

Column used to order records within each group (e.g., a timestamp). Requires group_training_examples_by.

None

max_sequences_per_example

int, "auto", or None

Limits how many sequences are added per example; otherwise the pipeline fills the model context. "auto" becomes 1 when Differential Privacy is enabled, and None otherwise. DP requires this to be 1.

"auto"

holdout

float or int

Fraction (0.0-1.0) or absolute number of records to hold out from the training data to use for evaluation purposes

0.05

max_holdout

int

Maximum number of records to hold out; caps the holdout regardless of how holdout is specified.

2000

random_state

int or None

Seed for the holdout split to ensure reproducibility. If omitted, a random integer is chosen automatically.

Randomly chosen if omitted

When to Use Grouping#

Grouping is most helpful for event-driven or multi-record-per-entity data. You can use grouping on its own to help the model learn inter-record correlations within an entity (for example, all transactions for a customer). If you also care about sequence dynamics within the group, pair grouping with order_training_examples_by so the model can learn temporal or sequential relationships (e.g., event timestamps).

If your data has no natural entity grouping or sequential structure (e.g., single-row examples with a categorical tag like genre), grouping is usually unnecessary.

When to Adjust the Holdout#

If your data has < 500 records, you may want to set the holdout to 0 in order to disable it and use all the data for training. If your data has < 200 records, you must set the holdout to 0 to avoid getting an error.

Example#

data:
  group_training_examples_by: customer_id
  order_training_examples_by: event_time
  max_sequences_per_example: "auto"  # 1 if DP is enabled, otherwise null
  holdout: 0.2
  max_holdout: 1000
  random_state: 42  # optional; omit to auto-pick a random seed

Error Handling#

Common Errors#

Dataset Too Small#

Dataset must have at least 200 records to use holdout.

Solution: Use a larger dataset or disable holdout by setting holdout: 0.

Holdout Too Small#

Holdout dataset must have at least 10 records.

Solution: Increase holdout size or use a larger input dataset.

Missing Group Column#

Group by column 'customer_id' not found in input dataset columns!

Solution: Verify column name exists in your dataset or set group_training_examples_by: null.

Group Column Has Missing Values#

Group by column 'customer_id' has missing values. Please remove/replace them.

Solution: Clean your data to remove or impute missing values in the grouping column.