Training Configuration#
Detailed configuration of training hyperparameters for optimal synthetic data generation model training in NVIDIA NeMo Safe Synthesizer.
Overview#
Training hyperparameters control how the model learns from your tabular data. These parameters directly affect training performance, model quality, and resource usage. The system provides sensible defaults while allowing fine-tuning for specific use cases.
Core Training Parameters#
Training Data Configuration#
Parameter |
Type |
Description |
Default |
Validation |
---|---|---|---|---|
|
int/auto |
Number of records for training |
“auto” |
≥ 0 |
{
"training": {
"num_input_records_to_sample": "auto" # Automatic selection
# OR
"num_input_records_to_sample": 50000 # Specific count
}
}
Guidelines for num_input_records_to_sample:
“auto”: System selects based on dataset size and complexity
Specific Count: Direct control over training data volume
Epoch Relationship: Same as number of records in dataset = 1 epoch, larger = multiple epochs, smaller = subsampling
Performance Impact: More records = longer training, potentially better quality
Batch Size Configuration#
Parameter |
Type |
Description |
Default |
Validation |
---|---|---|---|---|
|
int |
Training batch size per device |
1 |
≥ 1 |
|
int |
Steps to accumulate gradients |
8 |
≥ 1 |
{
"training": {
"batch_size": 1,
"gradient_accumulation_steps": 8 # Effective batch size = 1 * 8 = 8
}
}
Effective Batch Size Calculation:
Effective Batch Size = batch_size × gradient_accumulation_steps × num_devices
Memory vs Performance Trade-offs:
Small Batch (1-2): Lower memory, stable training, slower convergence
Large Batch (4-8): Higher memory, faster convergence, potential instability
Gradient Accumulation: Simulate larger batches without memory increase
Context Length Scaling#
Parameter |
Type |
Description |
Default |
Validation |
---|---|---|---|---|
|
int |
Scaling factor for context window |
“auto” |
1 <= value <= 6 |
{
"training": {
"rope_scaling_factor": "auto" # Automatic scaling
# OR
"rope_scaling_factor": 2 # 2x context length
}
}
Scaling Impact:
Enables processing longer tabular sequences
May affect training stability and performance
Requires more GPU memory
Optimization Parameters#
Learning Rate Configuration#
Parameter |
Type |
Description |
Default |
Validation |
---|---|---|---|---|
|
float |
Initial learning rate for |
0.0005 |
0 < value < 1 |
|
float |
Weight decay factor |
0.01 |
0 < value < 1 |
|
float |
Warmup ratio for learning rate |
0.05 |
> 0 |
{
"training": {
"learning_rate": 0.0005,
"weight_decay": 0.01,
"warmup_ratio": 0.05
}
}
Learning Rate Guidelines:
Conservative (1e-5 to 1e-4): Stable training, slower convergence
Standard (3e-4 to 1e-3): Balanced performance (recommended)
Aggressive (1e-3+): Fast convergence, potential instability
Learning Rate Scheduler#
Parameter |
Type |
Description |
Default |
---|---|---|---|
|
str |
Learning rate scheduler type |
“cosine” |
{
"training": {
"lr_scheduler": "cosine" # Cosine annealing
# Other options: "linear", "polynomial", "constant"
}
}
Scheduler Types:
cosine: Smooth decay, good for most cases
linear: Linear decay from initial to 0
polynomial: Polynomial decay curve
constant: No learning rate decay
LoRA-Specific Parameters#
LoRA Architecture Settings#
Parameter |
Type |
Description |
Default |
Range |
---|---|---|---|---|
|
int |
LoRA rank (complexity) |
32 |
> 0 |
|
float |
Alpha/rank ratio |
1.0 |
0.5-3.0 |
|
bool |
Rank-stabilized LoRA |
True |
True/False |
{
"training": {
"lora_r": 32,
"lora_alpha_over_r": 1.0,
"use_rslora": true
}
}
LoRA Alpha Calculation:
lora_alpha = lora_r × lora_alpha_over_r
Parameter Relationships:
Higher rank = more trainable parameters = better expressiveness
Alpha controls scaling of LoRA updates
RSLoRA provides better training stability
Target Module Selection#
{
"training": {
"lora_target_modules": [
"q_proj", # Query projection
"k_proj", # Key projection
"v_proj", # Value projection
"o_proj" # Output projection
]
}
}
Module Selection Strategies:
Attention-Only (Default):
"lora_target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"]
Fast training, good for most tabular data
Lower memory usage
Sufficient for pattern learning
Full Attention + MLP:
"lora_target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
Slower training, higher memory
Better for complex data patterns
More expressive model capacity
Unsloth Optimization#
{
"training": {
"use_unsloth": false # Standard training
# OR
"use_unsloth": true # Unsloth optimization
}
}
Unsloth Configuration:
Automatically optimizes model architecture
Provides 2-5x speedup for supported models
Incompatible with differential privacy
Validation Configuration#
Validation Settings#
Parameter |
Type |
Description |
Default |
Range |
---|---|---|---|---|
|
float |
Fraction of data for validation |
0.0 |
0.0-1.0 |
|
int |
Steps between validation checks |
15 |
> 0 |
{
"training": {
"validation_ratio": 0.1, # 10% for validation
"validation_steps": 15 # Validate every 15 steps
}
}
Hyperparameter Tuning Guidelines#
Performance Tuning#
Start with Defaults: Use default values as baseline
Adjust Gradually: Change one parameter at a time
Monitor Validation: Use validation to guide tuning
Consider Resources: Balance quality vs resource constraints
Common Tuning Patterns#
For Better Quality:
Increase
lora_r
(32 → 64)Add more
lora_target_modules
Increase
num_input_records_to_sample
Lower
learning_rate
for stability
For Faster Training:
Enable
use_unsloth
Increase
batch_size
(if memory allows)Reduce
lora_r
(32 → 16)Use fewer
lora_target_modules
For Memory Efficiency:
Set
batch_size=1
Increase
gradient_accumulation_steps
Reduce
lora_r
Enable
use_unsloth