Differential Privacy#

Configure model training with differential privacy for mathematical guarantees of privacy with NVIDIA NeMo Safe Synthesizer.

Overview#

A high level of privacy protection is achieved simply through the process of generating synthetic data, and is often a sufficient balance between privacy and utility. That said, some use cases require an even greater level of privacy.

Differential Privacy (DP) is generally regarded as the gold standard of privacy. NeMo Safe Synthesizer leverages the Differentially Private Stochastic Gradient Descent (DP-SGD) algorithm to provide formal privacy guarantees. This algorithm introduces calibrated noise during model training.

Differential Privacy Guarantee#

Core Promise: Differential privacy ensures that removing or changing any single record’s data from a dataset has only a negligible effect on what the algorithm learns or outputs.

Mathematical Formulation: For any algorithm M and any two datasets that differ by exactly one record, the following bound holds:

P[M(D1) ∈ S] ≤ exp(ε) × P[M(D2) ∈ S] + δ

Where:

M is the training algorithm
D1, D2 are neighboring datasets
ε is epsilon (privacy budget)
δ is delta (failure probability)
S is any subset of possible outputs

DP-SGD Implementation#

Differential privacy is implemented using DP-SGD, which modifies the standard training process to formally limit how much any individual training record can influence the final model while maintaining training effectiveness. The algorithm enhances standard SGD with the following steps:

Per-sample gradient computation: Calculate gradients individually for each sample in the mini-batch
Gradient clipping: Clip each gradient to a maximum L2 norm of per_sample_max_grad_norm to bound sensitivity
Noise injection: Add calibrated Gaussian noise to the aggregated clipped gradients
Privacy accounting: Track cumulative privacy expenditure using Rényi Differential Privacy (RDP) to ensure that the cumulative privacy loss across all training steps stays within the specified bounds (ε, δ).

The noise scale is automatically calibrated based on:

Target privacy budget (ε, δ)
Batch size and dataset size
Total number of training steps
Gradient clipping bound

By default, record-level differential privacy is used. However, when the group_training_examples_by parameter is set, we employ user-level differential privacy, meaning guarantees apply not to single records but to groups of records defined by that parameter—that is, if the whole group is removed, results will not change significantly.

Privacy vs Utility Trade-off#

Model Quality:

Lower epsilon values may reduce synthetic data quality, as the noise added to meet the DP guarantee might drown out the signal in the gradients.
Hyperparameters may need adjusted to maintain reasonable quality, such as lowering learning rate

Training Speed:

DP training is 2-3x slower than standard training, as it requires additional computational overhead for individual gradient processing. In some cases, GPU memory requirements may increase if batch size is increased beyond the default when using DP training.

Differential Privacy Configuration#

Basic DP Settings#

Parameter	Type	Description	Default	Range
`dp`	bool	Enable differential privacy	False	True/False
`epsilon`	float	Privacy budget (lower = more privacy)	8.0	> 0
`delta`	float/auto	Risk of exposure	“auto”	0 < value < 1
`per_sample_max_grad_norm`	float	Gradient clipping threshold	1.0	> 0

{
  "privacy": {
    "dp": true,
    "epsilon": 8.0,
    "delta": "auto",
    "per_sample_max_grad_norm": 1.0
  }
}

Privacy Budget (Epsilon)#

Epsilon Guidelines:

Epsilon (ε) controls the overall privacy guarantee. Smaller ε = stronger privacy, more noise; larger ε = weaker privacy, less noise.

Generally, use an epsilon value between 4.0 and 12.0 depending on use case sensitivity and dataset size. Start at ε ∈ [8, 12] and reduce from there as needed.

Delta Parameter#

Delta (δ) is the probability that the DP guarantee may fail. It should be very small—on the order of 1/n where n is the number of training records.

Delta Calculation:

“auto”: delta = 1 / n^1.2 (recommended default), where n is the number of training records.
Manual: Provide an explicit probability. Typical values are between 1e-6 and 1e-4, depending on dataset size and requirements.

Generally the auto calculation for delta is sufficient. It is far more common to adjust epsilon than delta.

DP Training Best Practices#

Use Larger Batches: DP benefits from larger batch sizes as this corresponds to an almost linear decrease in the standard deviation of the noise added to the average batch gradient.
- In case of memory errors, reduce batch size.
Monitor Convergence: Watch training and validation loss for convergence.
- In case of failure to converge, lower the learning rate and/or increase batch size.
Increase Training Data: More training data helps with DP synthetic data quality. Based on testing at NVIDIA, generally 10,000+ records is recommended.