Parameters Reference#

This page provides a complete reference for all configuration parameters available when creating NeMo Safe Synthesizer jobs. These schemas are automatically extracted from the authoritative OpenAPI specification, ensuring they are always in sync with the API.

Top-Level Configuration#

The SafeSynthesizerParameters schema defines the main configuration structure for Safe Synthesizer jobs.

Parameter	Type	Description
`data`	object	Configuration controlling how input data is grouped and split for training and evaluation.
`evaluation`	object	Parameters for evaluating the quality of generated synthetic data.
`training`	object	Hyperparameters for model training such as learning rate, batch size, and LoRA adapter settings.
`generation`	object	Parameters governing synthetic data generation including temperature, top-p, and number of records to produce.
`privacy`	object	Differential-privacy hyperparameters. When ``None``, differential privacy is disabled entirely.
`time_series`	object	Configuration for time-series mode. Time-series pipeline is currently experimental.
`replace_pii`	object	PII replacement configuration. When ``None``, PII replacement is skipped.

Data Parameters#

Configuration for how to shape or use the input data, including grouping, ordering, and holdout settings.

Parameter	Type	Description
`group_training_examples_by`	string	Column to group training examples by. This is useful when you want the model to learn inter-record correlations for a given grouping of records.
`order_training_examples_by`	string	Column to order training examples by. This is useful when you want the model to learn sequential relationships for a given ordering of records. If you provide this parameter, you must also provide ``group_training_examples_by``.
`max_sequences_per_example`	string \| integer	Default: `auto` If specified, adds at most this number of sequences per example. Supports 'auto' where a value of 1 is chosen if differential privacy is enabled, and 10 otherwise. If not specified or set to 'auto', fills up context. Required for DP to limit contribution of each example.
`holdout`	number	Default: `0.05` Amount of records to hold out for evaluation. If this is a float between 0 and 1, that ratio of records is held out. If an integer greater than 1, that number of records is held out. If the value is equal to zero, no holdout will be performed. Must be >= 0.
`max_holdout`	integer	Default: `2000` Maximum number of records to hold out. Overrides any behavior set by ``holdout``. Must be >= 0.
`random_state`	integer	Random state for holdout split to ensure reproducibility.

Training Parameters#

Hyperparameters for model fine-tuning, including learning rate, batch size, and LoRA configuration.

Parameter	Type	Description
`num_input_records_to_sample`	string \| integer	Default: `auto` Number of records the model will see during training. This parameter is a proxy for training time. For example, if its value is the same size as the input dataset, this is like training for a single epoch. If its value is larger, this is like training for multiple (possibly fractional) epochs. If its value is smaller, this is like training for a fraction of an epoch. Supports 'auto' where a reasonable value is chosen based on other config params and data.
`batch_size`	integer	Default: `1` The batch size per device for training. Must be >= 1.
`gradient_accumulation_steps`	integer	Default: `8` Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory. Must be >= 1.
`weight_decay`	number	Default: `0.01` The weight decay to apply to all layers except all bias and LayerNorm weights in the AdamW optimizer. Must be in (0, 1).
`warmup_ratio`	number	Default: `0.05` Ratio of total training steps used for a linear warmup from 0 to the learning rate. Must be > 0.
`lr_scheduler`	string	Default: `cosine` The scheduler type to use. See the HuggingFace documentation of ``SchedulerType`` for all possible values.
`learning_rate`	string \| number	Default: `auto` The initial learning rate for `AdamW` optimizer. Must be in (0, 1). Setting to 'auto' uses a model-specific default if one exists.
`lora_r`	integer	Default: `32` The rank of the LoRA update matrices. Lower rank results in smaller update matrices with fewer trainable parameters. Must be > 0.
`lora_alpha_over_r`	number	Default: `1.0` The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, this parameter works well when set to 0.5, 1, or 2. Must be in [0.5, 3].
`lora_target_modules`	string[]	Default: `['q_proj', 'k_proj', 'v_proj', 'o_proj']` The list of transformer modules to apply LoRA to. Possible modules: 'q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'.
`use_unsloth`	string \| boolean	Default: `auto` Whether to use Unsloth for optimized training.
`rope_scaling_factor`	string \| integer	Default: `auto` Scale the base LLM's context length by this factor using RoPE scaling. Must be >= 1 or 'auto'.
`validation_ratio`	number	Default: `0.0` The fraction of the training data used for validation. Must be in [0, 1]. If set to 0, no validation will be performed. If set larger than 0, validation loss will be computed and reported throughout training.
`validation_steps`	integer	Default: `15` The number of steps between validation checks for the HF Trainer arguments. Must be > 0.
`pretrained_model`	string	Default: `HuggingFaceTB/SmolLM3-3B` Pretrained model to use for fine-tuning. Defaults to SmolLM3. May be a Hugging Face model ID (loaded from the Hugging Face Hub or cache) or a local path. See security note in docs before using untrusted sources.
`quantize_model`	boolean	Default: `false` Whether to quantize the model during training. This can reduce memory usage and potentially speed up training, but may also impact model accuracy.
`quantization_bits`	integer	Default: `8` The number of bits to use for quantization if ``quantize_model`` is ``True``. Accepts 8 or 4. Allowed: `4`, `8`
`peft_implementation`	string	Default: `QLORA` The PEFT (Parameter-Efficient Fine-Tuning) implementation to use. Options: 'lora' for Low-Rank Adaptation, 'QLORA' for Quantized LoRA.
`max_vram_fraction`	number	Default: `0.8` The fraction of the total VRAM to use for training. Modify this to allow longer sequences. Must be in [0, 1].
`attn_implementation`	string	Default: `kernels-community/vllm-flash-attn3` The attention implementation to use for model loading. Default uses Flash Attention 3 via the HuggingFace Kernels Hub (requires the 'kernels' pip package; falls back to 'sdpa' if the 'kernels' package is not installed). Other common values: 'flash_attention_2' (requires flash-attn pip package), 'sdpa' (PyTorch scaled dot product attention), 'eager' (standard PyTorch). Custom HuggingFace Kernels Hub paths (e.g. 'kernels-community/flash-attn2') are also supported.

Generation Parameters#

Configuration for synthetic data generation after training, including number of records, temperature, and structured generation options.

Parameter	Type	Description
`num_records`	integer	Default: `1000` Number of records to generate.
`temperature`	number	Default: `0.9` Sampling temperature for controlling randomness (higher = more random).
`repetition_penalty`	number	Default: `1.0` The value used to control the likelihood of the model repeating the same token. Must be > 0.
`top_p`	number	Default: `1.0` Nucleus sampling probability for token selection. Must be in (0, 1].
`patience`	integer	Default: `3` Number of consecutive generations where the ``invalid_fraction_threshold`` is reached before stopping generation. Must be >= 1.
`invalid_fraction_threshold`	number	Default: `0.8` The fraction of invalid records that will stop generation after the ``patience`` limit is reached. Must be in [0, 1].
`use_structured_generation`	boolean	Default: `false` Whether to use structured generation for better format control.
`structured_generation_backend`	string	Default: `auto` The backend used by vLLM when ``use_structured_generation`` is ``True``. Supported backends: 'outlines', 'guidance', 'xgrammar', 'lm-format-enforcer'. 'auto' will allow vLLM to choose the backend. Allowed: `auto`, `xgrammar`, `guidance`, `outlines`, `lm-format-enforcer`
`structured_generation_schema_method`	string	Default: `regex` The method used to generate the schema from your dataset and pass it to the generation backend. 'regex' uses a custom regex construction method that tends to be more comprehensive than 'json_schema' at the cost of speed. Allowed: `regex`, `json_schema`
`structured_generation_use_single_sequence`	boolean	Default: `false` Whether to use a regex that matches exactly one sequence or record if ``max_sequences_per_example`` is 1.
`enforce_timeseries_fidelity`	boolean	Default: `false` Enforce time-series fidelity by enforcing order, intervals, start and end times of the records.
`validation`	object	Validation parameters controlling validation logic and automatic fixes when parsing LLM output and converting to tabular data.
`attention_backend`	string	Default: `auto` The attention backend for the vLLM engine. Common values: 'FLASHINFER', 'FLASH_ATTN', 'TRITON_ATTN', 'FLEX_ATTENTION'. If ``None`` or 'auto', vLLM will auto-select the best available backend.

Differential Privacy Parameters#

Hyperparameters for differential privacy during training using DP-SGD. Enable these for formal privacy guarantees.

Parameter	Type	Description
`dp_enabled`	boolean	Default: `false` Enable differentially-private training with DP-SGD.
`epsilon`	number	Default: `8.0` Target privacy budget -- lower values provide stronger privacy. Must be > 0.
`delta`	string \| number	Default: `auto` Probability of accidentally leaking information. Should be much smaller than 1/n where n is the number of training records. Setting to 'auto' uses delta of 1/n^1.2. Must be in [0, 1) or 'auto'.
`per_sample_max_grad_norm`	number	Default: `1.0` Maximum L2 norm for per-sample gradient clipping. Must be > 0.

Evaluation Parameters#

Configuration for synthetic data quality and privacy assessment, including MIA, AIA, and PII replay detection.

Parameter	Type	Description
`mia_enabled`	boolean	Default: `true` Enable membership inference attack evaluation for privacy assessment.
`aia_enabled`	boolean	Default: `true` Enable attribute inference attack evaluation for privacy assessment.
`sqs_report_columns`	integer	Default: `250` Number of columns to include in statistical quality reports.
`sqs_report_rows`	integer	Default: `5000` Number of rows to include in statistical quality reports.
`mandatory_columns`	integer	Number of mandatory columns that must be used in evaluation.
`enabled`	boolean	Default: `true` Enable or disable evaluation.
`quasi_identifier_count`	integer	Default: `3` Number of quasi-identifiers to sample for privacy attacks.
`pii_replay_enabled`	boolean	Default: `true` Enable PII Replay detection.
`pii_replay_entities`	string[]	List of entities for PII Replay. If not provided, default entities will be used.
`pii_replay_columns`	string[]	List of columns for PII Replay. If not provided, only entities will be used.

PII Replacement Configuration#

Configuration for PII detection and replacement. See PII Replacement for conceptual documentation.

Parameter	Type	Description
`globals`	object	Global configuration options.
`steps` *	object[]	List of transformation steps to perform on input data.

Column Classification Config (`replace_pii.globals.classify`)#

Column classification is configured via the SDK builder’s .with_classify_model_provider(provider_name) method. The provider name can be unqualified (the builder prepends the current workspace) or fully-qualified as workspace/provider_name.

If omitted, column classification is skipped and PII detection falls back to heuristic defaults, which may reduce accuracy.

Example Configuration#

Here’s an example showing a complete job configuration using the Python SDK:

import os
import pandas as pd

from nemo_platform import NeMoPlatform
from nemo_platform.beta.safe_synthesizer.job_builder import SafeSynthesizerJobBuilder

# Placeholders
df: pd.DataFrame = pd.DataFrame()
client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

builder = (
    SafeSynthesizerJobBuilder(client)
    .with_data_source(df)
    .with_train(
        num_input_records_to_sample=10000,
        learning_rate=0.0005,
        batch_size=1,
    )
    .with_generate(
        num_records=5000,
        temperature=0.9,
    )
    .with_differential_privacy(
        dp_enabled=True,
        epsilon=8.0,
    )
    .with_replace_pii()
    .synthesize()
)
job = builder.create_job(name="my-job", project="my-project")