Dispatch SDG to a Cluster#
This guide covers configuring an env.toml profile and running sdg/data_designer on Lepton or Slurm. Generation is CPU-only (no GPUs needed) and calls a remote LLM endpoint, so the step fits naturally on a CPU node with outbound network access.
env.toml Profile Shape#
Add a profile to env.toml (repository root). The example below targets a Lepton CPU node:
[lepton_sdg_data_designer]
executor = "lepton"
container_image = "nvcr.io/nvidia/nemo:25.11.nemotron_3_nano"
nemo_run_dir = "/mnt/shared/nemo-run"
nodes = 1
gpus_per_node = 0
resource_shape = "cpu.large"
node_group = "your-node-group"
shared_memory_size = 1024
can_be_preempted = true
queue_priority = "mid-4000"
startup_commands = [
"python -m pip install --quiet --break-system-packages 'data-designer==0.5.5'"
]
mounts = [
{ path = "/your-nfs-source", mount_path = "/mnt/shared", from = "node-nfs:your-nfs-id" }
]
[lepton_sdg_data_designer.env_vars]
NVIDIA_API_KEY = "${oc.env:NVIDIA_API_KEY}"
Run#
$ uv run --no-sync nemotron steps run sdg/data_designer -c default --batch lepton_sdg_data_designer num_records=1000
Use --run instead of --batch to stream logs interactively.
Known Gotchas#
These are the failure modes that commonly affect first-time cluster SDG runs.
data-designer is not pre-installed in the container#
The NeMo container image does not include data-designer. Install it at startup via startup_commands:
startup_commands = [
"python -m pip install --quiet --break-system-packages 'data-designer==0.5.5'"
]
Do not omit --break-system-packages — without it pip refuses to install into the system Python on recent NeMo images.
NVIDIA_API_KEY is not forwarded automatically#
Unlike HF_TOKEN and WANDB_API_KEY, NVIDIA_API_KEY is not automatically forwarded to the container. Declare it explicitly in the env_vars section:
[lepton_sdg_data_designer.env_vars]
NVIDIA_API_KEY = "${oc.env:NVIDIA_API_KEY}"
Set it in your local shell before submitting the job:
$ export NVIDIA_API_KEY="your-api-key"
$ uv run --no-sync nemotron steps run sdg/data_designer -c default --batch lepton_sdg_data_designer num_records=1000
Container image: always look up, never guess#
Do not invent image tags. nemo:latest does not exist on nvcr.io. Check src/nemotron/steps/sdg/data_designer/step.py header comments or src/nemotron/steps/env/env_toml/config/lepton.yaml for known-good image references before setting container_image.
Preemption and queue-priority fields were not wired (now fixed)#
can_be_preempted, can_preempt, and queue_priority are now forwarded from env.toml to LeptonExecutor. If you are on an older version of the repo where these were silently ignored, upgrade before expecting preemption scheduling to take effect.
Slurm Profile#
For Slurm, replace the Lepton-specific fields with Slurm equivalents. The startup_commands and env_vars gotchas apply equally:
[slurm-sdg]
executor = "slurm"
container_image = "nvcr.io/nvidia/nemo:25.11.nemotron_3_nano"
nemo_run_dir = "/lustre/team/nemo-run"
nodes = 1
gpus_per_node = 0
run_partition = "cpu"
batch_partition = "cpu"
startup_commands = [
"python -m pip install --quiet --break-system-packages 'data-designer==0.5.5'"
]
[slurm-sdg.env_vars]
NVIDIA_API_KEY = "${oc.env:NVIDIA_API_KEY}"
Tip
On clusters where the default partition requires GPUs (for example, NVIDIA’s dlw cluster), set run_partition and batch_partition to a CPU-capable partition. gpus_per_node = 0 alone is not sufficient — the partition itself must accept zero-GPU jobs.
Verify Before Scaling#
Run a preview via the cluster profile before a large batch:
$ uv run --no-sync nemotron steps run sdg/data_designer -c default --run lepton_sdg_data_designer preview=true num_records=2
Confirm the job reaches Running, the model alias check succeeds, and two records are returned before submitting the full job.
Next Steps#
env.toml reference:
docs/nemo_runspec/nemo-run.md— full profile field reference.CLI flags: CLI Reference.
Troubleshooting: Troubleshooting — full failure-mode reference.