Generation Configuration#

Configure synthetic data generation parameters in NVIDIA NeMo Safe Synthesizer, including sampling settings, quality control, and structured generation options.

Overview#

Generation configuration controls how synthetic data is created from trained models. These parameters affect the quality, diversity, and validity of generated synthetic records, with built-in quality control mechanisms to ensure reliable output.

Generation Parameters#

Core Generation Settings#

Parameter

Type

Description

Default

Validation

num_records

int

Number of synthetic records to generate

1000

0 < value ≤ 130,000

temperature

float

Sampling temperature for randomness

0.9

> 0

repetition_penalty

float

Penalty for token repetition

1.0

≥ 1.0

top_p

float

Nucleus sampling probability

1.0

0 < value ≤ 1

{
  "generation": {
    "num_records": 5000,
    "temperature": 0.8,
    "repetition_penalty": 1.1,
    "top_p": 0.95
  }
}

Quality Control Settings#

Parameter

Type

Description

Default

Validation

patience

int

Consecutive invalid batches before stopping

1

≥ 1

invalid_fraction_threshold

float

Invalid record fraction threshold

0.8

0.0-1.0

use_structured_generation

bool

Enable structured generation

False

True/False

{
  "generation": {
    "patience": 3,
    "invalid_fraction_threshold": 0.6,
    "use_structured_generation": true
  }
}

Record Generation Limits#

Maximum Record Constraints#

MAX_RECORDS_TO_GENERATE = 130000
if num_records >= MAX_RECORDS_TO_GENERATE:
    raise ParameterError(
        "The number of records requested to generate is larger than the max "
        f"allowed number of {MAX_RECORDS_TO_GENERATE}. Reduce the "
        "number of records or break into smaller batches."
    )

Batch Processing for Large Datasets#

Note

Before you start, make sure that you have:

  • Stored CSVs locally

  • Uploaded them using the following steps:

export HF_ENDPOINT="http://localhost:3000/v1/hf"
huggingface-cli upload --repo-type dataset default/safe-synthesizer dataset.csv

For datasets requiring more than 130,000 records:

# Split into multiple jobs
batch_configs = [
    {"generation": {"num_records": 50000}},  # Batch 1
    {"generation": {"num_records": 50000}},  # Batch 2
    {"generation": {"num_records": 30000}}   # Batch 3
]

# Process in parallel using REST API
for i, config in enumerate(batch_configs):
    job_request = {
        "name": f"batch-{i}",
        "project": "default",
        "spec": {
            "data_source": "hf://datasets/default/safe-synthesizer/dataset.csv",
            "config": {
              "enable_synthesis": True,
              "enable_replace_pii": True,
              "replace_pii": {"steps": [{"rows": {"update": [{"entity": ["email"], "value": "column.entity | fake"}]}}]},
            }
        }
    }
    
    job = client.beta.safe_synthesizer.jobs.create(**job_request)
    
    # Access results after job completion
    results = client.beta.safe_synthesizer.jobs.results.list(job.id)

Sampling Parameters#

Temperature Control#

Control randomness in synthetic data generation:

{
  "generation": {
    "temperature": 0.1   # Very conservative, low diversity
    # OR
    "temperature": 0.8   # Balanced diversity (recommended)
    # OR  
    "temperature": 1.5   # High diversity, more variation
  }
}

Temperature Guidelines:

  • Low (0.1-0.5): Conservative generation, consistent patterns

  • Medium (0.6-1.0): Balanced diversity and consistency

  • High (1.0-2.0): High diversity, more creative generation

Nucleus Sampling (Top-P)#

Control token selection probability:

{
  "generation": {
    "top_p": 0.9   # Consider top 90% probability mass
  }
}

Top-P Guidelines:

  • High (0.95-1.0): Include more token options, higher diversity

  • Medium (0.8-0.95): Balanced selection (recommended)

  • Low (0.5-0.8): Focused selection, more consistent output

Repetition Penalty#

Prevent repetitive generation patterns:

{
  "generation": {
    "repetition_penalty": 1.0   # No penalty (default)
    # OR
    "repetition_penalty": 1.1   # Light penalty
    # OR
    "repetition_penalty": 1.3   # Strong penalty
  }
}

Penalty Guidelines:

  • 1.0: No repetition penalty

  • 1.1-1.2: Light penalty for minor repetition reduction

  • 1.3-1.5: Strong penalty for highly diverse output

Quality Control#

Stopping Conditions#

Configure automatic stopping for poor-quality generation:

{
  "generation": {
    "patience": 3,                      # Allow 3 consecutive bad batches
    "invalid_fraction_threshold": 0.8   # Stop if >80% records invalid
  }
}
  1. Monitor invalid record fraction per batch

  2. Count consecutive batches exceeding threshold

  3. Stop generation when patience limit reached

  4. Prevent extremely long generation jobs with poor quality

Structured Generation#

Enable schema-aware generation for better format control:

{
  "generation": {
    "use_structured_generation": true
  }
}

Structured Generation Features:

  • Schema Validation: Ensure outputs match expected format

  • Type Enforcement: Maintain data types and constraints

  • Format Control: Generate valid tabular structures

  • Quality Improvement: Reduce invalid record generation

Generation Process Control#

Batch Generation#

The system generates data in batches with adaptive sizing:

def get_next_num_prompts(self) -> int:
    """Return optimal number of prompts for next batch."""
    if self.num_valid_records > 0:
        valid_records_per_prompt = self.num_valid_records / self.num_prompts
        num_prompts_needed = round(num_records_remaining / valid_records_per_prompt)
        return min(max_prompts_per_batch, num_prompts_needed + buffer)

Adaptive Batch Sizing:

  • Starts with default batch size

  • Adjusts based on valid record generation rate

  • Optimizes for target record count

  • Includes buffer for generation variability

Progress Monitoring#

The system provides detailed progress monitoring:

# Generation progress logging
logger.info(
    f"Progress: {generation.num_valid_records} out of {num_records} records generated."
)

# Performance metrics
records_per_second = batch.num_valid_records / duration
logger.info(f"Generation speed: {records_per_second:.1f} records per second.")

Custom Stopping Conditions#

{
  "generation": {
    "patience": 5,                    # Allow more bad batches
    "invalid_fraction_threshold": 0.3  # Stricter quality requirement
  }
}

Stopping Behavior:

  • Monitor running average of invalid record fraction

  • Count consecutive batches exceeding threshold

  • Stop generation when patience limit reached

  • Prevent infinite generation loops