Create Job#

Create a NeMo Safe Synthesizer job to process your data through PII replacement and the synthetic data generation pipeline.

Prerequisites#

Before you can create a NeMo Safe Synthesizer job, make sure that you have:

  • Obtained the base URL of your NeMo Safe Synthesizer service

  • Set the SAFE_SYN_BASE_URL environment variable to your NeMo Safe Synthesizer service endpoint

  • Dataset uploaded to Data Store (for example, <DATASET_ID>). Refer to Datasets for dataset management details.

export SAFE_SYN_BASE_URL="https://your-safe-synthesizer-service-url"
export NEMO_MICROSERVICES_DATASTORE_URL="https://your-datastore-service-url"

Note

Environment Variables: SAFE_SYN_BASE_URL is used by the client SDK to connect to the Safe Synthesizer API service. The service itself uses NEMO_MICROSERVICES_DATASTORE_URL to access datasets from the Data Store.


To Create a NeMo Safe Synthesizer Job#

Choose one of the following job configurations based on your privacy and synthetic data requirements.

PII Redaction Only#

For basic PII detection and redaction without synthetic data generation:

import os
from nemo_microservices import NeMoMicroservices

# Initialize the client
client = NeMoMicroservices(
    base_url=os.environ['SAFE_SYN_BASE_URL']
)

# Create PII redaction job
job = client.beta.safe_synthesizer.jobs.create(
    name="pii-redaction",
    project="default",
    spec={
        "data_source": "<DATASET_ID>",
        "config": {
          "enable_synthesis": False,
          "enable_replace_pii": True,
          "replace_pii": {
              "globals": {
                  "classify": {"enable_classify": True},
                  "ner": {"ner_threshold": 0.3},
                  "locales": ["en_US"]
              },
              "steps": [{
                  "rows": {
                      "update": [
                          {"condition": "column.entity == 'email'", "value": "fake.email()"},
                          {"condition": "column.entity == 'phone_number'", "value": "fake.phone_number()"}
                      ]
                  }
              }]
          }
        }
    },
)

print(f"Created job: {job.id}")
print(f"Job status: {job.status}")
curl -X POST "${SAFE_SYN_BASE_URL}/v1beta1/safe-synthesizer/jobs" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "pii-redaction",
    "project": "default",
    "spec": {
      "data_source": "<DATASET_ID>",
      "config": {
        "enable_synthesis": false,
        "enable_replace_pii": true,
        "replace_pii": {
          "globals": {
            "classify": {"enable_classify": true},
            "ner": {"ner_threshold": 0.3},
            "locales": ["en_US"]
          },
          "steps": [{
            "rows": {
              "update": [
                {"condition": "column.entity == \"email\"", "value": "fake.email()"},
                {"condition": "column.entity == \"phone_number\"", "value": "fake.phone_number()"}
              ]
            }
          }]
        }
      }
    }
  }' | jq
Example PII Redaction Response
{
  "id": "job-abc123def456",
  "name": "pii-redaction",
  "project": "default",
  "status": "created",
  "created_at": "2024-01-15T10:30:00.000Z",
  "updated_at": "2024-01-15T10:30:00.000Z"
}

Full Pipeline (PII, Train, Generate, Evaluate)#

For a complete NeMo Safe Synthesizer job including PII replacement, model training, synthetic data generation, and privacy evaluation:

# Create full pipeline job
job = client.beta.safe_synthesizer.jobs.create(
    name="safe-synthesizer-full",
    project="default",
    spec={
        "data_source": "<DATASET_ID>",
        "config": {
            "enable_synthesis": True,
            "enable_replace_pii": True,
            "replace_pii": {
                "globals": {
                    "classify": {"enable_classify": True},
                    "ner": {"ner_threshold": 0.3},
                    "locales": ["en_US"]
                },
                "steps": [{
                    "rows": {
                        "update": [
                            {"condition": "column.entity == 'email'", "value": "fake.email()"},
                            {"condition": "column.entity == 'phone_number'", "value": "fake.phone_number()"}
                        ]
                    }
                }]
            },
            "data": {
                "holdout": 0.2,
                "max_holdout": 1000,
            },
            "training": {
                "pretrained_model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",  # Default model
                "num_input_records_to_sample": "auto",
                "lora_r": 16,
                "lora_alpha_over_r": 1.0
                "group_training_examples_by": None
            },
            "generation": {
                "num_records": 1000,
                "temperature": 0.8,
                "repetition_penalty": 1.2,
                "use_structured_generation": True
            },
            "privacy": {
                "privacy_hyperparams": {"dp": True, "epsilon": 8.0, "delta": "auto"}
            },
            "evaluation": {"mia_enabled": True, "aia_enabled": True}
        }
    },
)

print(f"Created full pipeline job: {job.id}")
print(f"Job status: {job.status}")
curl -X POST "${SAFE_SYN_BASE_URL}/v1beta1/safe-synthesizer/jobs" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "safe-synthesizer-full",
    "project": "default",
    "spec": {
      "data_source": "<DATASET_ID>",
      "config": {
        "enable_synthesis": true,
        "enable_replace_pii": true,
        "data": {
          "holdout": 0.2,
          "max_holdout": 1000,
        },
        "training": {
          "pretrained_model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
          "num_input_records_to_sample": "auto",
          "lora_r": 16,
          "lora_alpha_over_r": 1.0
          "group_training_examples_by": null
        },
        "generation": {
          "num_records": 1000,
          "temperature": 0.8,
          "repetition_penalty": 1.2,
          "use_structured_generation": true
        },
        "privacy": {
          "privacy_hyperparams": {"dp": true, "epsilon": 8.0, "delta": "auto"}
        },
        "evaluation": {"mia_enabled": true, "aia_enabled": true},
        "replace_pii": {
          "globals": {
            "classify": {"enable_classify": true},
            "ner": {"ner_threshold": 0.3},
            "locales": ["en_US"]
          },
          "steps": [{
            "rows": {
              "update": [
                {"condition": "column.entity == \"email\"", "value": "fake.email()"},
                {"condition": "column.entity == \"phone_number\"", "value": "fake.phone_number()"}
              ]
            }
          }]
        }
      }
    }
  }' | jq
Example Full Pipeline Response
{
  "id": "job-def456ghi789",
  "name": "safe-synthesizer-full",
  "project": "default",
  "status": "created",
  "created_at": "2024-01-15T11:00:00.000Z",
  "updated_at": "2024-01-15T11:00:00.000Z"
}

Configuration Options#

The NeMo Safe Synthesizer job configuration supports several key sections:

Job Control Flags#

  • enable_synthesis: Boolean flag to enable synthetic data generation

  • enable_replace_pii: Boolean flag to enable PII replacement processing

Configuration Sections#

  • replace_pii: Configure PII detection and transformation rules (when enable_replace_pii is true)

  • config: Main configuration object containing:

    • data: Configure train/test data splitting for evaluation

    • training: Specify model training parameters and hyperparameters

    • generation: Control synthetic data generation settings

    • privacy: Enable differential privacy and set privacy parameters

    • evaluation: Configure privacy evaluation metrics (MIA, AIA)

New Holdout Parameters#

The holdout functionality allows you to automatically split your dataset for evaluation:

  • holdout: Fraction (0.0-1.0) or absolute number of records to hold out for evaluation

  • max_holdout: Max number of records to hold out (caps the holdout when using fractions)

  • random_state: Optional random state for reproducible holdout split

Job Management#

After creating a job, you can:

Error Handling#

Common issues when creating jobs:

  • Invalid dataset ID: Ensure the dataset exists in your Data Store

  • Configuration errors: Validate your JSON configuration syntax

  • Resource limits: Check if you have sufficient compute resources

  • Permission errors: Verify your API access permissions

Configuration Validation#

The API validates job configurations before processing. Common validation errors include:

  • Invalid model names: Ensure the pretrained model exists and is accessible

  • Malformed PII replacement rules: Check JSON syntax in transformation conditions

  • Resource conflicts: Verify training parameters are compatible with available resources

// Example validation error response
{
  "detail": "Invalid configuration: pretrained_model 'invalid-model' not found",
  "type": "validation_error"
}
# Example with error handling
try:
    job = client.beta.safe_synthesizer.jobs.create(
        name="my-job",
        project="default",
        spec={
            "data_source": dataset_id,
            "config": {
              "enable_replace_pii": False,
              "enable_synthesis": False,
            }
        },
    )
    print(f"Job created successfully: {job.id}")
    
except Exception as e:
    print(f"Failed to create job: {e}")
    # Handle specific error cases