Generation Configuration#
Configure synthetic data generation parameters in NVIDIA NeMo Safe Synthesizer, including sampling settings, quality control, and structured generation options.
Overview#
Generation configuration controls how synthetic data is created from trained models. These parameters affect the quality, diversity, and validity of generated synthetic records, with built-in quality control mechanisms to ensure reliable output.
Generation Parameters#
Core Generation Settings#
Parameter |
Type |
Description |
Default |
Validation |
---|---|---|---|---|
|
int |
Number of synthetic records to generate |
1000 |
0 < value ≤ 130,000 |
|
float |
Sampling temperature for randomness |
0.9 |
> 0 |
|
float |
Penalty for token repetition |
1.0 |
≥ 1.0 |
|
float |
Nucleus sampling probability |
1.0 |
0 < value ≤ 1 |
{
"generation": {
"num_records": 5000,
"temperature": 0.8,
"repetition_penalty": 1.1,
"top_p": 0.95
}
}
Quality Control Settings#
Parameter |
Type |
Description |
Default |
Validation |
---|---|---|---|---|
|
int |
Consecutive invalid batches before stopping |
1 |
≥ 1 |
|
float |
Invalid record fraction threshold |
0.8 |
0.0-1.0 |
|
bool |
Enable structured generation |
False |
True/False |
{
"generation": {
"patience": 3,
"invalid_fraction_threshold": 0.6,
"use_structured_generation": true
}
}
Record Generation Limits#
Maximum Record Constraints#
MAX_RECORDS_TO_GENERATE = 130000
if num_records >= MAX_RECORDS_TO_GENERATE:
raise ParameterError(
"The number of records requested to generate is larger than the max "
f"allowed number of {MAX_RECORDS_TO_GENERATE}. Reduce the "
"number of records or break into smaller batches."
)
Batch Processing for Large Datasets#
Note
Before you start, make sure that you have:
Stored CSVs locally
Uploaded them using the following steps:
export HF_ENDPOINT="http://localhost:3000/v1/hf"
huggingface-cli upload --repo-type dataset default/safe-synthesizer dataset.csv
For datasets requiring more than 130,000 records:
# Split into multiple jobs
batch_configs = [
{"generation": {"num_records": 50000}}, # Batch 1
{"generation": {"num_records": 50000}}, # Batch 2
{"generation": {"num_records": 30000}} # Batch 3
]
# Process in parallel using REST API
for i, config in enumerate(batch_configs):
job_request = {
"name": f"batch-{i}",
"project": "default",
"spec": {
"data_source": "hf://datasets/default/safe-synthesizer/dataset.csv",
"config": {
"enable_synthesis": True,
"enable_replace_pii": True,
"replace_pii": {"steps": [{"rows": {"update": [{"entity": ["email"], "value": "column.entity | fake"}]}}]},
}
}
}
job = client.beta.safe_synthesizer.jobs.create(**job_request)
# Access results after job completion
results = client.beta.safe_synthesizer.jobs.results.list(job.id)
Sampling Parameters#
Temperature Control#
Control randomness in synthetic data generation:
{
"generation": {
"temperature": 0.1 # Very conservative, low diversity
# OR
"temperature": 0.8 # Balanced diversity (recommended)
# OR
"temperature": 1.5 # High diversity, more variation
}
}
Temperature Guidelines:
Low (0.1-0.5): Conservative generation, consistent patterns
Medium (0.6-1.0): Balanced diversity and consistency
High (1.0-2.0): High diversity, more creative generation
Nucleus Sampling (Top-P)#
Control token selection probability:
{
"generation": {
"top_p": 0.9 # Consider top 90% probability mass
}
}
Top-P Guidelines:
High (0.95-1.0): Include more token options, higher diversity
Medium (0.8-0.95): Balanced selection (recommended)
Low (0.5-0.8): Focused selection, more consistent output
Repetition Penalty#
Prevent repetitive generation patterns:
{
"generation": {
"repetition_penalty": 1.0 # No penalty (default)
# OR
"repetition_penalty": 1.1 # Light penalty
# OR
"repetition_penalty": 1.3 # Strong penalty
}
}
Penalty Guidelines:
1.0: No repetition penalty
1.1-1.2: Light penalty for minor repetition reduction
1.3-1.5: Strong penalty for highly diverse output
Quality Control#
Stopping Conditions#
Configure automatic stopping for poor-quality generation:
{
"generation": {
"patience": 3, # Allow 3 consecutive bad batches
"invalid_fraction_threshold": 0.8 # Stop if >80% records invalid
}
}
Monitor invalid record fraction per batch
Count consecutive batches exceeding threshold
Stop generation when patience limit reached
Prevent extremely long generation jobs with poor quality
Structured Generation#
Enable schema-aware generation for better format control:
{
"generation": {
"use_structured_generation": true
}
}
Structured Generation Features:
Schema Validation: Ensure outputs match expected format
Type Enforcement: Maintain data types and constraints
Format Control: Generate valid tabular structures
Quality Improvement: Reduce invalid record generation
Generation Process Control#
Batch Generation#
The system generates data in batches with adaptive sizing:
def get_next_num_prompts(self) -> int:
"""Return optimal number of prompts for next batch."""
if self.num_valid_records > 0:
valid_records_per_prompt = self.num_valid_records / self.num_prompts
num_prompts_needed = round(num_records_remaining / valid_records_per_prompt)
return min(max_prompts_per_batch, num_prompts_needed + buffer)
Adaptive Batch Sizing:
Starts with default batch size
Adjusts based on valid record generation rate
Optimizes for target record count
Includes buffer for generation variability
Progress Monitoring#
The system provides detailed progress monitoring:
# Generation progress logging
logger.info(
f"Progress: {generation.num_valid_records} out of {num_records} records generated."
)
# Performance metrics
records_per_second = batch.num_valid_records / duration
logger.info(f"Generation speed: {records_per_second:.1f} records per second.")
Custom Stopping Conditions#
{
"generation": {
"patience": 5, # Allow more bad batches
"invalid_fraction_threshold": 0.3 # Stricter quality requirement
}
}
Stopping Behavior:
Monitor running average of invalid record fraction
Count consecutive batches exceeding threshold
Stop generation when patience limit reached
Prevent infinite generation loops