Troubleshooting NeMo Auditor#
Use this documentation to troubleshoot issues that can arise when you run audit jobs with NVIDIA NeMo Auditor.
Troubleshooting Audit Jobs#
The first step for troubleshooting audit jobs is to check for audit job progress.
import os from nemo_microservices import NeMoMicroservices client = NeMoMicroservices(base_url=os.getenv("AUDITOR_BASE_URL")) status = client.beta.audit.jobs.get_status(job_id) print(status.model_dump_json(indent=2))
curl "${AUDITOR_BASE_URL}/v1beta1/audit/jobs/${JOB_ID}/status" \ -H "Accept: application/json" | jq
In the following example output, the job is active, but the completed probes value is
0.{ "job_id": "job-5cxic84crn8k1sxf4xaqb2", "status": "active", "status_details": { "progress": { "nprobes": 1, "nprobes_complete": 0 } }, "error_details": null, "steps": [ { "name": "audit", "status": "active", "status_details": {}, "error_details": {}, "tasks": [ { "id": "91479a57c5374c88a4a55be1c3d45c55", "status": "active", "status_details": {}, "error_details": {}, "error_stack": null } ] } ] }
If the message in the job status response does not indicate an error, the next step is to check the logs.
client = NeMoMicroservices(base_url=os.getenv("AUDITOR_BASE_URL")) logs = client.beta.audit.jobs.get_logs(job_id) print("".join(log.message for log in logs.data[-10:]))
curl "${AUDITOR_BASE_URL}/v1beta1/audit/jobs/${JOB_ID}/logs" \ -H "Accept: text/plain"
The logs often report 404 or 429 HTTP status codes. For these status codes, check the URI for the job target.
Another trouble scenario is that the target model is unresponsive and connections eventually time out:
2025-09-02 14:19:22,928 DEBUG response_closed.complete 2025-09-02 14:19:22,924 DEBUG Encountered httpx.TimeoutException Traceback (most recent call last): File "/app/.venv/lib/python3.11/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions yield ...
If the issue is still unclear, get the target for the job and the try to run inference with the target model.
Jobs on Kubernetes#
On Kubernetes, when you start an audit job, the core-api microservice starts an instance of NeMo Auditor microservice as a Kubernetes job. Getting the status for a failed job reports the name of the job and the pod:
{
"job_id": "job-ppdpvptwim5w5b5hxgeumw",
"status": "error",
"status_details": {},
"error_details": {
"pods": [
{
"name": "jobstep-7f5cramwfhfithi9fuieat-5jlnf",
"phase": "Failed",
"conditions": [
{
"type": "PodReadyToStartContainers",
"status": "False",
"reason": null,
"message": null
},
{
"type": "Initialized",
"status": "True",
"reason": null,
"message": null
},
{
"type": "Ready",
"status": "False",
"reason": "PodFailed",
"message": null
},
{
"type": "ContainersReady",
"status": "False",
"reason": "PodFailed",
"message": null
},
{
"type": "PodScheduled",
"status": "True",
"reason": null,
"message": null
}
],
"containers": [
{
"name": "nemo-job-task",
"ready": false,
"restart_count": 0,
"terminated": {
"exit_code": 1,
"reason": "Error",
"message": null
}
}
]
}
]
},
"steps": [
{
"name": "audit",
"status": "error",
"status_details": {},
"error_details": {
"pods": [
{
"name": "jobstep-7f5cramwfhfithi9fuieat-5jlnf",
"phase": "Failed",
"conditions": [
{
"type": "PodReadyToStartContainers",
"status": "False",
"reason": null,
"message": null
},
{
"type": "Initialized",
"status": "True",
"reason": null,
"message": null
},
{
"type": "Ready",
"status": "False",
"reason": "PodFailed",
"message": null
},
{
"type": "ContainersReady",
"status": "False",
"reason": "PodFailed",
"message": null
},
{
"type": "PodScheduled",
"status": "True",
"reason": null,
"message": null
}
],
"containers": [
{
"name": "nemo-job-task",
"ready": false,
"restart_count": 0,
"terminated": {
"exit_code": 1,
"reason": "Error",
"message": null
}
}
]
}
]
},
"tasks": [
{
"id": "f45369be-5a06-4dba-ba9f-8fc49a235092",
"status": "error",
"status_details": {},
"error_details": {},
"error_stack": null
}
]
}
]
}
You can run kubectl logs for the failed pod to get more information about the failed job.