Create Job#
Create a NeMo Safe Synthesizer job to process your data through PII replacement and the synthetic data generation pipeline.
Prerequisites#
Before you can create a NeMo Safe Synthesizer job, make sure that you have:
Obtained the base URL of your NeMo Safe Synthesizer service
Set the
SAFE_SYN_BASE_URL
environment variable to your NeMo Safe Synthesizer service endpointDataset uploaded to Data Store (for example,
<DATASET_ID>
). Refer to Datasets for dataset management details.
export SAFE_SYN_BASE_URL="https://your-safe-synthesizer-service-url"
export NEMO_MICROSERVICES_DATASTORE_URL="https://your-datastore-service-url"
Note
Environment Variables: SAFE_SYN_BASE_URL
is used by the client SDK to connect to the Safe Synthesizer API service. The service itself uses NEMO_MICROSERVICES_DATASTORE_URL
to access datasets from the Data Store.
To Create a NeMo Safe Synthesizer Job#
Choose one of the following job configurations based on your privacy and synthetic data requirements.
PII Redaction Only#
For basic PII detection and redaction without synthetic data generation:
import os
from nemo_microservices import NeMoMicroservices
# Initialize the client
client = NeMoMicroservices(
base_url=os.environ['SAFE_SYN_BASE_URL']
)
# Create PII redaction job
job = client.beta.safe_synthesizer.jobs.create(
name="pii-redaction",
project="default",
spec={
"data_source": "<DATASET_ID>",
"config": {
"enable_synthesis": False,
"enable_replace_pii": True,
"replace_pii": {
"globals": {
"classify": {"enable_classify": True},
"ner": {"ner_threshold": 0.3},
"locales": ["en_US"]
},
"steps": [{
"rows": {
"update": [
{"condition": "column.entity == 'email'", "value": "fake.email()"},
{"condition": "column.entity == 'phone_number'", "value": "fake.phone_number()"}
]
}
}]
}
}
},
)
print(f"Created job: {job.id}")
print(f"Job status: {job.status}")
curl -X POST "${SAFE_SYN_BASE_URL}/v1beta1/safe-synthesizer/jobs" \
-H "Content-Type: application/json" \
-d '{
"name": "pii-redaction",
"project": "default",
"spec": {
"data_source": "<DATASET_ID>",
"config": {
"enable_synthesis": false,
"enable_replace_pii": true,
"replace_pii": {
"globals": {
"classify": {"enable_classify": true},
"ner": {"ner_threshold": 0.3},
"locales": ["en_US"]
},
"steps": [{
"rows": {
"update": [
{"condition": "column.entity == \"email\"", "value": "fake.email()"},
{"condition": "column.entity == \"phone_number\"", "value": "fake.phone_number()"}
]
}
}]
}
}
}
}' | jq
Example PII Redaction Response
{
"id": "job-abc123def456",
"name": "pii-redaction",
"project": "default",
"status": "created",
"created_at": "2024-01-15T10:30:00.000Z",
"updated_at": "2024-01-15T10:30:00.000Z"
}
Full Pipeline (PII, Train, Generate, Evaluate)#
For a complete NeMo Safe Synthesizer job including PII replacement, model training, synthetic data generation, and privacy evaluation:
# Create full pipeline job
job = client.beta.safe_synthesizer.jobs.create(
name="safe-synthesizer-full",
project="default",
spec={
"data_source": "<DATASET_ID>",
"config": {
"enable_synthesis": True,
"enable_replace_pii": True,
"replace_pii": {
"globals": {
"classify": {"enable_classify": True},
"ner": {"ner_threshold": 0.3},
"locales": ["en_US"]
},
"steps": [{
"rows": {
"update": [
{"condition": "column.entity == 'email'", "value": "fake.email()"},
{"condition": "column.entity == 'phone_number'", "value": "fake.phone_number()"}
]
}
}]
},
"data": {
"holdout": 0.2,
"max_holdout": 1000,
},
"training": {
"pretrained_model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", # Default model
"num_input_records_to_sample": "auto",
"lora_r": 16,
"lora_alpha_over_r": 1.0
"group_training_examples_by": None
},
"generation": {
"num_records": 1000,
"temperature": 0.8,
"repetition_penalty": 1.2,
"use_structured_generation": True
},
"privacy": {
"privacy_hyperparams": {"dp": True, "epsilon": 8.0, "delta": "auto"}
},
"evaluation": {"mia_enabled": True, "aia_enabled": True}
}
},
)
print(f"Created full pipeline job: {job.id}")
print(f"Job status: {job.status}")
curl -X POST "${SAFE_SYN_BASE_URL}/v1beta1/safe-synthesizer/jobs" \
-H "Content-Type: application/json" \
-d '{
"name": "safe-synthesizer-full",
"project": "default",
"spec": {
"data_source": "<DATASET_ID>",
"config": {
"enable_synthesis": true,
"enable_replace_pii": true,
"data": {
"holdout": 0.2,
"max_holdout": 1000,
},
"training": {
"pretrained_model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"num_input_records_to_sample": "auto",
"lora_r": 16,
"lora_alpha_over_r": 1.0
"group_training_examples_by": null
},
"generation": {
"num_records": 1000,
"temperature": 0.8,
"repetition_penalty": 1.2,
"use_structured_generation": true
},
"privacy": {
"privacy_hyperparams": {"dp": true, "epsilon": 8.0, "delta": "auto"}
},
"evaluation": {"mia_enabled": true, "aia_enabled": true},
"replace_pii": {
"globals": {
"classify": {"enable_classify": true},
"ner": {"ner_threshold": 0.3},
"locales": ["en_US"]
},
"steps": [{
"rows": {
"update": [
{"condition": "column.entity == \"email\"", "value": "fake.email()"},
{"condition": "column.entity == \"phone_number\"", "value": "fake.phone_number()"}
]
}
}]
}
}
}
}' | jq
Example Full Pipeline Response
{
"id": "job-def456ghi789",
"name": "safe-synthesizer-full",
"project": "default",
"status": "created",
"created_at": "2024-01-15T11:00:00.000Z",
"updated_at": "2024-01-15T11:00:00.000Z"
}
Configuration Options#
The NeMo Safe Synthesizer job configuration supports several key sections:
Job Control Flags#
enable_synthesis
: Boolean flag to enable synthetic data generationenable_replace_pii
: Boolean flag to enable PII replacement processing
Configuration Sections#
replace_pii
: Configure PII detection and transformation rules (whenenable_replace_pii
is true)config
: Main configuration object containing:data: Configure train/test data splitting for evaluation
training: Specify model training parameters and hyperparameters
generation: Control synthetic data generation settings
privacy: Enable differential privacy and set privacy parameters
evaluation: Configure privacy evaluation metrics (MIA, AIA)
New Holdout Parameters#
The holdout functionality allows you to automatically split your dataset for evaluation:
holdout
: Fraction (0.0-1.0) or absolute number of records to hold out for evaluationmax_holdout
: Max number of records to hold out (caps the holdout when using fractions)random_state
: Optional random state for reproducible holdout split
Job Management#
After creating a job, you can:
Monitor job status to track progress
View job logs for detailed processing information
Cancel running jobs if needed
Access job results once processing is complete
Error Handling#
Common issues when creating jobs:
Invalid dataset ID: Ensure the dataset exists in your Data Store
Configuration errors: Validate your JSON configuration syntax
Resource limits: Check if you have sufficient compute resources
Permission errors: Verify your API access permissions
Configuration Validation#
The API validates job configurations before processing. Common validation errors include:
Invalid model names: Ensure the pretrained model exists and is accessible
Malformed PII replacement rules: Check JSON syntax in transformation conditions
Resource conflicts: Verify training parameters are compatible with available resources
// Example validation error response
{
"detail": "Invalid configuration: pretrained_model 'invalid-model' not found",
"type": "validation_error"
}
# Example with error handling
try:
job = client.beta.safe_synthesizer.jobs.create(
name="my-job",
project="default",
spec={
"data_source": dataset_id,
"config": {
"enable_replace_pii": False,
"enable_synthesis": False,
}
},
)
print(f"Job created successfully: {job.id}")
except Exception as e:
print(f"Failed to create job: {e}")
# Handle specific error cases