Is this page helpful?

Data Preparation Configuration#

Configure parameters to prepare your data for NVIDIA NeMo Safe Synthesizer, including holdout size and columns for grouping and ordering.

Overview#

In NVIDIA NeMo Safe Synthesizer, “data preparation” refers to training-time configuration that controls how tabular records are grouped, ordered, and serialized into training sequences for Tabular Fine-Tuning.

If you do not have event-driven data, it is unlikely you should need to adjust any parameters in this section.

Parameters#

Parameter	Type	Description	Default
`group_training_examples_by`	`str` or `None`	Column name used to group related records into a single training example (for event-driven data).	`None`
`order_training_examples_by`	`str` or `None`	Column used to order records within each group (e.g., a timestamp). Requires `group_training_examples_by`.	`None`
`max_sequences_per_example`	`int`, `"auto"`, or `None`	Limits how many sequences are added per example; otherwise the pipeline fills the model context. `"auto"` becomes `1` when Differential Privacy is enabled, and `None` otherwise. DP requires this to be `1`.	`"auto"`
`holdout`	`float` or `int`	Fraction (0.0-1.0) or absolute number of records to hold out from the training data to use for evaluation purposes	`0.05`
`max_holdout`	`int`	Maximum number of records to hold out; caps the holdout regardless of how `holdout` is specified.	`2000`
`random_state`	`int` or `None`	Seed for the holdout split to ensure reproducibility. If omitted, a random integer is chosen automatically.	Randomly chosen if omitted

When to Use Grouping#

Grouping is most helpful for event-driven or multi-record-per-entity data. You can use grouping on its own to help the model learn inter-record correlations within an entity (for example, all transactions for a customer). If you also care about sequence dynamics within the group, pair grouping with order_training_examples_by so the model can learn temporal or sequential relationships (e.g., event timestamps).

If your data has no natural entity grouping or sequential structure (e.g., single-row examples with a categorical tag like genre), grouping is usually unnecessary.

When to Adjust the Holdout#

If your data has < 500 records, you may want to set the holdout to 0 in order to disable it and use all the data for training. If your data has < 200 records, you must set the holdout to 0 to avoid getting an error.

Example#

data:
  group_training_examples_by: customer_id
  order_training_examples_by: event_time
  max_sequences_per_example: "auto"  # 1 if DP is enabled, otherwise null
  holdout: 0.2
  max_holdout: 1000
  random_state: 42  # optional; omit to auto-pick a random seed

Error Handling#

Common Errors#

Dataset Too Small#

Dataset must have at least 200 records to use holdout.

Solution: Use a larger dataset or disable holdout by setting holdout: 0.

Holdout Too Small#

Holdout dataset must have at least 10 records.

Solution: Increase holdout size or use a larger input dataset.

Missing Group Column#

Group by column 'customer_id' not found in input dataset columns!

Solution: Verify column name exists in your dataset or set group_training_examples_by: null.

Group Column Has Missing Values#

Group by column 'customer_id' has missing values. Please remove/replace them.

Solution: Clean your data to remove or impute missing values in the grouping column.