Training Configuration#

Detailed configuration of training hyperparameters for optimal synthetic data generation model training in NVIDIA NeMo Safe Synthesizer.

Overview#

Training hyperparameters control how the model learns from your tabular data. These parameters directly affect training performance, model quality, and resource usage. The system provides sensible defaults while allowing fine-tuning for specific use cases.

Core Training Parameters#

Training Data Configuration#

Parameter	Type	Description	Default	Validation
`num_input_records_to_sample`	int/auto	Number of records for training	“auto”	≥ 0

{
  "training": {
    "num_input_records_to_sample": "auto"  # Automatic selection
    # OR
    "num_input_records_to_sample": 50000   # Specific count
  }
}

Guidelines for num_input_records_to_sample:

“auto”: System selects based on dataset size and complexity
Specific Count: Direct control over training data volume
Epoch Relationship: Same as number of records in dataset = 1 epoch, larger = multiple epochs, smaller = subsampling
Performance Impact: More records = longer training, potentially better quality

Batch Size Configuration#

Parameter	Type	Description	Default	Validation
`batch_size`	int	Training batch size per device	1	≥ 1
`gradient_accumulation_steps`	int	Steps to accumulate gradients	8	≥ 1

{
  "training": {
    "batch_size": 1,
    "gradient_accumulation_steps": 8  # Effective batch size = 1 * 8 = 8
  }
}

Effective Batch Size Calculation:

Effective Batch Size = batch_size × gradient_accumulation_steps × num_devices

Memory vs Performance Trade-offs:

Small Batch (1-2): Lower memory, stable training, slower convergence
Large Batch (4-8): Higher memory, faster convergence, potential instability
Gradient Accumulation: Simulate larger batches without memory increase

Context Length Scaling#

Parameter	Type	Description	Default	Validation
`rope_scaling_factor`	int	Scaling factor for context window	“auto”	1 <= value <= 6

{
  "training": {
    "rope_scaling_factor": "auto"  # Automatic scaling
    # OR
    "rope_scaling_factor": 2       # 2x context length
  }
}

Scaling Impact:

Enables processing longer tabular sequences
May affect training stability and performance
Requires more GPU memory

Optimization Parameters#

Learning Rate Configuration#

Parameter	Type	Description	Default	Validation
`learning_rate`	float	Initial learning rate for `AdamW` optimizer	0.0005	0 < value < 1
`weight_decay`	float	Weight decay factor	0.01	0 < value < 1
`warmup_ratio`	float	Warmup ratio for learning rate	0.05	> 0

{
  "training": {
    "learning_rate": 0.0005,
    "weight_decay": 0.01,
    "warmup_ratio": 0.05
  }
}

Learning Rate Guidelines:

Conservative (1e-5 to 1e-4): Stable training, slower convergence
Standard (3e-4 to 1e-3): Balanced performance (recommended)
Aggressive (1e-3+): Fast convergence, potential instability

Learning Rate Scheduler#

Parameter	Type	Description	Default
`lr_scheduler`	str	Learning rate scheduler type	“cosine”

{
  "training": {
    "lr_scheduler": "cosine"  # Cosine annealing
    # Other options: "linear", "polynomial", "constant"
  }
}

Scheduler Types:

cosine: Smooth decay, good for most cases
linear: Linear decay from initial to 0
polynomial: Polynomial decay curve
constant: No learning rate decay

LoRA-Specific Parameters#

LoRA Architecture Settings#

Parameter	Type	Description	Default	Range
`lora_r`	int	LoRA rank (complexity)	32	> 0
`lora_alpha_over_r`	float	Alpha/rank ratio	1.0	0.5-3.0
`use_rslora`	bool	Rank-stabilized LoRA	True	True/False

{
  "training": {
    "lora_r": 32,
    "lora_alpha_over_r": 1.0,
    "use_rslora": true
  }
}

LoRA Alpha Calculation:

lora_alpha = lora_r × lora_alpha_over_r

Parameter Relationships:

Higher rank = more trainable parameters = better expressiveness
Alpha controls scaling of LoRA updates
RSLoRA provides better training stability

Target Module Selection#

{
  "training": {
    "lora_target_modules": [
      "q_proj",    # Query projection
      "k_proj",    # Key projection  
      "v_proj",    # Value projection
      "o_proj"     # Output projection
    ]
  }
}

Module Selection Strategies:

Attention-Only (Default):

"lora_target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"]

Fast training, good for most tabular data
Lower memory usage
Sufficient for pattern learning

Full Attention + MLP:

"lora_target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

Slower training, higher memory
Better for complex data patterns
More expressive model capacity

Unsloth Optimization#

{
  "training": {
    "use_unsloth": false  # Standard training
    # OR  
    "use_unsloth": true   # Unsloth optimization
  }
}

Unsloth Configuration:

Automatically optimizes model architecture
Provides 2-5x speedup for supported models
Incompatible with differential privacy

Validation Configuration#

Validation Settings#

Parameter	Type	Description	Default	Range
`validation_ratio`	float	Fraction of data for validation	0.0	0.0-1.0
`validation_steps`	int	Steps between validation checks	15	> 0

{
  "training": {
    "validation_ratio": 0.1,      # 10% for validation
    "validation_steps": 15        # Validate every 15 steps
  }
}

Hyperparameter Tuning Guidelines#

Performance Tuning#

Start with Defaults: Use default values as baseline
Adjust Gradually: Change one parameter at a time
Monitor Validation: Use validation to guide tuning
Consider Resources: Balance quality vs resource constraints

Common Tuning Patterns#

For Better Quality:

Increase lora_r (32 → 64)
Add more lora_target_modules
Increase num_input_records_to_sample
Lower learning_rate for stability

For Faster Training:

Enable use_unsloth
Increase batch_size (if memory allows)
Reduce lora_r (32 → 16)
Use fewer lora_target_modules

For Memory Efficiency:

Set batch_size=1
Increase gradient_accumulation_steps
Reduce lora_r
Enable use_unsloth