Training Configuration#

Detailed configuration of training hyperparameters for optimal synthetic data generation model training in NVIDIA NeMo Safe Synthesizer.

Overview#

Training hyperparameters control how the model learns from your tabular data. These parameters directly affect training performance, model quality, and resource usage. The system provides sensible defaults while allowing fine-tuning for specific use cases.

Core Training Parameters#

Training Data Configuration#

Parameter

Type

Description

Default

Validation

num_input_records_to_sample

int/auto

Number of records for training

“auto”

≥ 0

{
  "training": {
    "num_input_records_to_sample": "auto"  # Automatic selection
    # OR
    "num_input_records_to_sample": 50000   # Specific count
  }
}

Guidelines for num_input_records_to_sample:

  • “auto”: System selects based on dataset size and complexity

  • Specific Count: Direct control over training data volume

  • Epoch Relationship: Same as number of records in dataset = 1 epoch, larger = multiple epochs, smaller = subsampling

  • Performance Impact: More records = longer training, potentially better quality

Batch Size Configuration#

Parameter

Type

Description

Default

Validation

batch_size

int

Training batch size per device

1

≥ 1

gradient_accumulation_steps

int

Steps to accumulate gradients

8

≥ 1

{
  "training": {
    "batch_size": 1,
    "gradient_accumulation_steps": 8  # Effective batch size = 1 * 8 = 8
  }
}

Effective Batch Size Calculation:

Effective Batch Size = batch_size × gradient_accumulation_steps × num_devices

Memory vs Performance Trade-offs:

  • Small Batch (1-2): Lower memory, stable training, slower convergence

  • Large Batch (4-8): Higher memory, faster convergence, potential instability

  • Gradient Accumulation: Simulate larger batches without memory increase

Context Length Scaling#

Parameter

Type

Description

Default

Validation

rope_scaling_factor

int

Scaling factor for context window

“auto”

1 <= value <= 6

{
  "training": {
    "rope_scaling_factor": "auto"  # Automatic scaling
    # OR
    "rope_scaling_factor": 2       # 2x context length
  }
}

Scaling Impact:

  • Enables processing longer tabular sequences

  • May affect training stability and performance

  • Requires more GPU memory

Optimization Parameters#

Learning Rate Configuration#

Parameter

Type

Description

Default

Validation

learning_rate

float

Initial learning rate for AdamW optimizer

0.0005

0 < value < 1

weight_decay

float

Weight decay factor

0.01

0 < value < 1

warmup_ratio

float

Warmup ratio for learning rate

0.05

> 0

{
  "training": {
    "learning_rate": 0.0005,
    "weight_decay": 0.01,
    "warmup_ratio": 0.05
  }
}

Learning Rate Guidelines:

  • Conservative (1e-5 to 1e-4): Stable training, slower convergence

  • Standard (3e-4 to 1e-3): Balanced performance (recommended)

  • Aggressive (1e-3+): Fast convergence, potential instability

Learning Rate Scheduler#

Parameter

Type

Description

Default

lr_scheduler

str

Learning rate scheduler type

“cosine”

{
  "training": {
    "lr_scheduler": "cosine"  # Cosine annealing
    # Other options: "linear", "polynomial", "constant"
  }
}

Scheduler Types:

  • cosine: Smooth decay, good for most cases

  • linear: Linear decay from initial to 0

  • polynomial: Polynomial decay curve

  • constant: No learning rate decay

LoRA-Specific Parameters#

LoRA Architecture Settings#

Parameter

Type

Description

Default

Range

lora_r

int

LoRA rank (complexity)

32

> 0

lora_alpha_over_r

float

Alpha/rank ratio

1.0

0.5-3.0

use_rslora

bool

Rank-stabilized LoRA

True

True/False

{
  "training": {
    "lora_r": 32,
    "lora_alpha_over_r": 1.0,
    "use_rslora": true
  }
}

LoRA Alpha Calculation:

lora_alpha = lora_r × lora_alpha_over_r

Parameter Relationships:

  • Higher rank = more trainable parameters = better expressiveness

  • Alpha controls scaling of LoRA updates

  • RSLoRA provides better training stability

Target Module Selection#

{
  "training": {
    "lora_target_modules": [
      "q_proj",    # Query projection
      "k_proj",    # Key projection  
      "v_proj",    # Value projection
      "o_proj"     # Output projection
    ]
  }
}

Module Selection Strategies:

Attention-Only (Default):

"lora_target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"]
  • Fast training, good for most tabular data

  • Lower memory usage

  • Sufficient for pattern learning

Full Attention + MLP:

"lora_target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
  • Slower training, higher memory

  • Better for complex data patterns

  • More expressive model capacity

Unsloth Optimization#

{
  "training": {
    "use_unsloth": false  # Standard training
    # OR  
    "use_unsloth": true   # Unsloth optimization
  }
}

Unsloth Configuration:

  • Automatically optimizes model architecture

  • Provides 2-5x speedup for supported models

  • Incompatible with differential privacy

Validation Configuration#

Validation Settings#

Parameter

Type

Description

Default

Range

validation_ratio

float

Fraction of data for validation

0.0

0.0-1.0

validation_steps

int

Steps between validation checks

15

> 0

{
  "training": {
    "validation_ratio": 0.1,      # 10% for validation
    "validation_steps": 15        # Validate every 15 steps
  }
}

Hyperparameter Tuning Guidelines#

Performance Tuning#

  1. Start with Defaults: Use default values as baseline

  2. Adjust Gradually: Change one parameter at a time

  3. Monitor Validation: Use validation to guide tuning

  4. Consider Resources: Balance quality vs resource constraints

Common Tuning Patterns#

For Better Quality:

  • Increase lora_r (32 → 64)

  • Add more lora_target_modules

  • Increase num_input_records_to_sample

  • Lower learning_rate for stability

For Faster Training:

  • Enable use_unsloth

  • Increase batch_size (if memory allows)

  • Reduce lora_r (32 → 16)

  • Use fewer lora_target_modules

For Memory Efficiency:

  • Set batch_size=1

  • Increase gradient_accumulation_steps

  • Reduce lora_r

  • Enable use_unsloth