Data Preparation Configuration#
Configure parameters to prepare your data for NVIDIA NeMo Safe Synthesizer, including holdout size and columns for grouping and ordering.
Overview#
In NVIDIA NeMo Safe Synthesizer, “data preparation” refers to training-time configuration that controls how tabular records are grouped, ordered, and serialized into training sequences for Tabular Fine-Tuning.
If you do not have event-driven data, it is unlikely you should need to adjust any parameters in this section.
Parameters#
Parameter |
Type |
Description |
Default |
---|---|---|---|
|
|
Column name used to group related records into a single training example (for event-driven data). |
|
|
|
Column used to order records within each group (e.g., a timestamp). Requires |
|
|
|
Limits how many sequences are added per example; otherwise the pipeline fills the model context. |
|
|
|
Fraction (0.0-1.0) or absolute number of records to hold out from the training data to use for evaluation purposes |
|
|
|
Maximum number of records to hold out; caps the holdout regardless of how |
|
|
|
Seed for the holdout split to ensure reproducibility. If omitted, a random integer is chosen automatically. |
Randomly chosen if omitted |
When to Use Grouping#
Grouping is most helpful for event-driven or multi-record-per-entity data. You can use grouping on its own to help the model learn inter-record correlations within an entity (for example, all transactions for a customer). If you also care about sequence dynamics within the group, pair grouping with order_training_examples_by
so the model can learn temporal or sequential relationships (e.g., event timestamps).
If your data has no natural entity grouping or sequential structure (e.g., single-row examples with a categorical tag like genre), grouping is usually unnecessary.
When to Adjust the Holdout#
If your data has < 500 records, you may want to set the holdout to 0 in order to disable it and use all the data for training. If your data has < 200 records, you must set the holdout to 0 to avoid getting an error.
Example#
data:
group_training_examples_by: customer_id
order_training_examples_by: event_time
max_sequences_per_example: "auto" # 1 if DP is enabled, otherwise null
holdout: 0.2
max_holdout: 1000
random_state: 42 # optional; omit to auto-pick a random seed
Error Handling#
Common Errors#
Dataset Too Small#
Dataset must have at least 200 records to use holdout.
Solution: Use a larger dataset or disable holdout by setting holdout: 0
.
Holdout Too Small#
Holdout dataset must have at least 10 records.
Solution: Increase holdout size or use a larger input dataset.
Missing Group Column#
Group by column 'customer_id' not found in input dataset columns!
Solution: Verify column name exists in your dataset or set group_training_examples_by: null
.
Group Column Has Missing Values#
Group by column 'customer_id' has missing values. Please remove/replace them.
Solution: Clean your data to remove or impute missing values in the grouping column.