Troubleshooting#

Failure modes for local runs and cluster dispatch. For cluster-specific setup, see Dispatch SDG to a Cluster.

Local Run Failures#

Unknown column type: 'person' or similar ValueError

Cause: The YAML declares a column type that step.py’s build_columns() does not recognise. Currently supported types: category, seed, llm_text, llm_structured, llm_judge.

Solution: Check the spelling. For person and datetime sampler support, step.py must be extended — see the extension reference in Config Schema.

config must declare a non-empty columns: list

Cause: The YAML has an empty or missing columns: block.

Solution: Add at least one column spec. A minimal config must include at least one llm_text or llm_structured column that produces output content.

Jinja2 template references an undefined variable

Cause: A prompt uses {{ column_name }} but column_name is neither a declared column, a seed field in seed_dataset.fields, nor an earlier column in the list.

Solution: Add the column or seed field, or fix the typo. Run preview=true num_records=2 to catch this cheaply before a full generation job.

Model health check fails at startup

Cause: Data Designer probes the model endpoint at startup. If the model is not available from the configured provider, or if NVIDIA_API_KEY is not set, the probe fails and the step exits before generating any records.

Solution:

  • Confirm export NVIDIA_API_KEY="..." is set.

  • Add skip_health_check: true to the model spec to bypass the probe (useful for local or vLLM endpoints that aren’t in the provider catalog).

Output JSONL is empty or has fewer records than num_records

Cause: Data Designer skips or drops records where the structured output doesn’t validate against output_format, or where the LLM returns a refusal.

Solution:

  • Run preview=true and inspect a sample for refusals or schema mismatches.

  • Simplify the output_format if the model consistently fails to match a complex schema.

  • Raise max_tokens if responses are being cut off mid-JSON.

Cluster Dispatch Failures#

Job exits immediately with No such file or directory (launch script)

Cause: nemo_run_dir is not on shared storage. The data-mover sidecar writes the launch script to nemo_run_dir, but the main container cannot see it if the path is local to a different node or not mounted.

Solution: Set nemo_run_dir to a path on the shared NFS mount and add the corresponding mounts entry to the env.toml profile. See Dispatch SDG to a Cluster.

data-designer import error inside the container

Cause: The NeMo container image does not pre-install data-designer.

Solution: Add to startup_commands:

startup_commands = [
    "python -m pip install --quiet --break-system-packages 'data-designer==0.5.5'"
]
Job rejected or OOM-killed immediately on a CPU node

Cause: The default shared_memory_size (65536 MB) exceeds the available RAM on the CPU node type.

Solution: Set shared_memory_size = 1024 in the env.toml profile. The SDG step makes no use of shared memory.

NVIDIA_API_KEY not available inside the container

Cause: NVIDIA_API_KEY is not automatically forwarded to the job environment the way HF_TOKEN and WANDB_API_KEY are.

Solution: Declare it explicitly in the env.toml profile:

[lepton_sdg_data_designer.env_vars]
NVIDIA_API_KEY = "${oc.env:NVIDIA_API_KEY}"

And set it in your shell before submitting: export NVIDIA_API_KEY="...".