Troubleshooting#

This page lists common symptoms when you run nemotron steps run translate/nemo_curator and shows the field, flag, or environment variable to inspect first. Each table pairs a symptom with a concrete remedy. For stage flow and design rationale, see the explanation pages linked from Concepts.

Authentication and Credentials#

Symptom	What to do
HTTP 401 or 403 from the chat-completions endpoint, or a Curator log line about a missing API key	Confirm the variable named in `server.api_key_env` is exported in the shell that launches the run. The starter `default.yaml` expects `NVIDIA_API_KEY`; export it with `export NVIDIA_API_KEY="<api-key>"` and rerun. See Run LLM Translation.
FAITH scoring fails with a credentials error even though `backend` is `nmt`, `google`, or `aws`	FAITH always uses the large language model (LLM) client under `server`. Keep `server.api_key_env` populated whenever `faith_eval.enabled` is `true`, or set `faith_eval.enabled=false` for a diagnostic run. See Run FAITH Evaluation.
Google backend rejects the request with a permission or project error	Confirm application default credentials are present in the environment that runs the step. Do not paste secrets into `default.yaml`. See Run Google or AWS Translation.

Model and Endpoint Configuration#

Symptom	What to do
HTTP 404 or a “model not found” message from the LLM endpoint	Hosted catalogs retire identifiers frequently. List the models your tenant currently exposes and pin `server.model` to one of them before large batch jobs. See Run LLM Translation.
Google translation rejects the request because `project_id` is missing	API version `v3` requires project metadata. Set both `google.project_id` and `google.api_version=v3`, or downgrade `google.api_version` to a release that does not require the project. See Run Google or AWS Translation.
NMT requests time out before the service responds	Raise `nmt.timeout` to match observed server latency, lower `nmt.batch_size` so each request returns sooner, and confirm `nmt.server_url` resolves from the host that runs the step. See Run NMT Translation.

Throttling and Concurrency#

Symptom	What to do
HTTP 429 responses, bursty failures, or sustained slowdowns from a hosted LLM endpoint	Lower `max_concurrent_requests` in your YAML and rerun on a smaller slice of data. Confirm your tenant quota covers the planned batch size. See Translation YAML Reference.
A self-hosted NMT service returns errors under load	Reduce `nmt.max_concurrent_requests` and `nmt.batch_size` together, then raise them only after the service reports healthy throughput. See Run NMT Translation.

Inputs and Output Layout#

Symptom	What to do
Reader errors about mixed file types when `input_path` points at a directory containing both JSONL and Parquet files	Curator readers expect one record format per directory. Split the inputs into separate directories for JSON Lines (JSONL) and Parquet, or set `input_path` to a single file. See Input and Output Format.
Ray worker logs show `Creating virtual environment at: .venv` followed by `ModuleNotFoundError: No module named 'ray'`	Export `RAY_ENABLE_UV_RUN_RUNTIME_ENV=0` before running local `uv run --no-sync nemotron steps run translate/nemo_curator ...`. This keeps Ray workers in the synchronized Nemotron environment.
Empty JSONL input fails with `No data read from files in task file_group_0`	The reader found no records. Treat the run as an empty-input validation failure, confirm the input path is correct, and rerun with a non-empty file or directory.
Output shards do not appear under `output_dir` after the run reports success	The writer emits partitioned files, not a single merged file. Inspect the shard pattern under `output_dir` and confirm `output_format` matches what downstream consumers expect. See Input and Output Format.

FAITH Evaluation#

Symptom	What to do
Every translated row is dropped after FAITH runs	The `faith_eval.threshold` value may be too strict for the chosen scorer model. Lower the threshold, set `faith_eval.filter_enabled=false` while you tune, or override the scorer with `faith_eval.model_name`. See Run FAITH Evaluation.
FAITH scores look inconsistent across runs of the same data	Pin both `server.model` and `faith_eval.model_name` to specific identifiers so scorer drift does not move the threshold under you. See FAITH Evaluation Inside Translation.