Generation Configuration#

Configure synthetic data generation parameters in NVIDIA NeMo Safe Synthesizer, including sampling settings, quality control, and structured generation options.

Overview#

Generation configuration controls how synthetic data is created from trained models. These parameters affect the quality, diversity, and validity of generated synthetic records, with built-in quality control mechanisms to ensure reliable output.

Generation Parameters#

Core Generation Settings#

Parameter	Type	Description	Default	Validation
`num_records`	int	Number of synthetic records to generate	1000	0 < value ≤ 130,000
`temperature`	float	Sampling temperature for randomness	0.9	> 0
`repetition_penalty`	float	Penalty for token repetition	1.0	≥ 1.0
`top_p`	float	Nucleus sampling probability	1.0	0 < value ≤ 1

{
  "generation": {
    "num_records": 5000,
    "temperature": 0.8,
    "repetition_penalty": 1.1,
    "top_p": 0.95
  }
}

Quality Control Settings#

Parameter	Type	Description	Default	Validation
`patience`	int	Consecutive invalid batches before stopping	1	≥ 1
`invalid_fraction_threshold`	float	Invalid record fraction threshold	0.8	0.0-1.0
`use_structured_generation`	bool	Enable structured generation	False	True/False

{
  "generation": {
    "patience": 3,
    "invalid_fraction_threshold": 0.6,
    "use_structured_generation": true
  }
}

Record Generation Limits#

Maximum Record Constraints#

MAX_RECORDS_TO_GENERATE = 130000

if num_records >= MAX_RECORDS_TO_GENERATE:
    raise ParameterError(
        "The number of records requested to generate is larger than the max "
        f"allowed number of {MAX_RECORDS_TO_GENERATE}. Reduce the "
        "number of records or break into smaller batches."
    )

Batch Processing for Large Datasets#

Note

Before you start, make sure that you have:

Stored CSVs locally
Uploaded them using the following steps:

export HF_ENDPOINT="http://localhost:3000/v1/hf"
hf upload --repo-type dataset default/safe-synthesizer dataset.csv

For datasets requiring more than 130,000 records:

# Split into multiple jobs
batch_configs = [
    {"generation": {"num_records": 50000}},  # Batch 1
    {"generation": {"num_records": 50000}},  # Batch 2
    {"generation": {"num_records": 30000}}   # Batch 3
]

# Process in parallel using REST API
for i, config in enumerate(batch_configs):
    job_request = {
        "name": f"batch-{i}",
        "project": "default",
        "spec": {
            "data_source": "hf://datasets/default/safe-synthesizer/dataset.csv",
            "config": {
              "enable_synthesis": True,
              "enable_replace_pii": True,
              "replace_pii": {"steps": [{"rows": {"update": [{"entity": ["email"], "value": "column.entity | fake"}]}}]},
            }
        }
    }
    
    job = client.beta.safe_synthesizer.jobs.create(**job_request)
    
    # Access results after job completion
    results = client.beta.safe_synthesizer.jobs.results.list(job.id)

Sampling Parameters#

Temperature Control#

Control randomness in synthetic data generation:

{
  "generation": {
    "temperature": 0.1   # Very conservative, low diversity
    # OR
    "temperature": 0.8   # Balanced diversity (recommended)
    # OR  
    "temperature": 1.5   # High diversity, more variation
  }
}

Temperature Guidelines:

Low (0.1-0.5): Conservative generation, consistent patterns
Medium (0.6-1.0): Balanced diversity and consistency
High (1.0-2.0): High diversity, more creative generation

Nucleus Sampling (Top-P)#

Control token selection probability:

{
  "generation": {
    "top_p": 0.9   # Consider top 90% probability mass
  }
}

Top-P Guidelines:

High (0.95-1.0): Include more token options, higher diversity
Medium (0.8-0.95): Balanced selection (recommended)
Low (0.5-0.8): Focused selection, more consistent output

Repetition Penalty#

Prevent repetitive generation patterns:

{
  "generation": {
    "repetition_penalty": 1.0   # No penalty (default)
    # OR
    "repetition_penalty": 1.1   # Light penalty
    # OR
    "repetition_penalty": 1.3   # Strong penalty
  }
}

Penalty Guidelines:

1.0: No repetition penalty
1.1-1.2: Light penalty for minor repetition reduction
1.3-1.5: Strong penalty for highly diverse output

Quality Control#

Stopping Conditions#

Configure automatic stopping for poor-quality generation:

{
  "generation": {
    "patience": 3,                      # Allow 3 consecutive bad batches
    "invalid_fraction_threshold": 0.8   # Stop if >80% records invalid
  }
}

Monitor invalid record fraction per batch
Count consecutive batches exceeding threshold
Stop generation when patience limit reached
Prevent extremely long generation jobs with poor quality

Structured Generation#

Enable schema-aware generation for better format control:

{
  "generation": {
    "use_structured_generation": true
  }
}

Structured Generation Features:

Schema Validation: Ensure outputs match expected format
Type Enforcement: Maintain data types and constraints
Format Control: Generate valid tabular structures
Quality Improvement: Reduce invalid record generation

Generation Process Control#

Batch Generation#

The system generates data in batches with adaptive sizing:

def get_next_num_prompts(self) -> int:
    """Return optimal number of prompts for next batch."""
    if self.num_valid_records > 0:
        valid_records_per_prompt = self.num_valid_records / self.num_prompts
        num_prompts_needed = round(num_records_remaining / valid_records_per_prompt)
        return min(max_prompts_per_batch, num_prompts_needed + buffer)

Adaptive Batch Sizing:

Starts with default batch size
Adjusts based on valid record generation rate
Optimizes for target record count
Includes buffer for generation variability

Progress Monitoring#

The system provides detailed progress monitoring:

# Generation progress logging
logger.info(
    f"Progress: {generation.num_valid_records} out of {num_records} records generated."
)

# Performance metrics
records_per_second = batch.num_valid_records / duration
logger.info(f"Generation speed: {records_per_second:.1f} records per second.")

Custom Stopping Conditions#

{
  "generation": {
    "patience": 5,                    # Allow more bad batches
    "invalid_fraction_threshold": 0.3  # Stricter quality requirement
  }
}

Stopping Behavior:

Monitor running average of invalid record fraction
Count consecutive batches exceeding threshold
Stop generation when patience limit reached
Prevent infinite generation loops