Troubleshooting NeMo Auditor#

Use this documentation to troubleshoot issues that can arise when you run audit jobs with NVIDIA NeMo Auditor.

Troubleshooting Audit Jobs#

  1. The first step for troubleshooting audit jobs is to check for audit job progress.

    import os
    from nemo_microservices import NeMoMicroservices
    
    client = NeMoMicroservices(base_url=os.getenv("AUDITOR_BASE_URL"))
    
    status = client.beta.audit.jobs.get_status(job_id)
    print(status)
    
      curl "${AUDITOR_BASE_URL}/v1beta1/audit/jobs/${JOB_ID}/status" \
        -H "Accept: application/json" | jq
    

    In the following example output, the job is active, but the completed probes value is 0.

    AuditJobStatus(message=None, progress=Progress(probes_complete=0, probes_total=22), status='ACTIVE')
    
    {
      "status": "ACTIVE",
      "message": null,
      "progress": {
        "probes_total": 22,
        "probes_complete": 0
      }
    }
    
  2. If the message in the job status response does not indicate an error, the next step is to check the logs.

    client = NeMoMicroservices(base_url=os.getenv("AUDITOR_BASE_URL"))
    logs = client.beta.audit.jobs.get_logs(job_id)
    print("\n".join(logs.split("\n")[-10:]))
    
        curl "${AUDITOR_BASE_URL}/v1beta1/audit/jobs/${JOB_ID}/logs" \
          -H "Accept: text/plain" | tail -n 10
    

    The logs often report 404 or 429 HTTP status codes. For these status codes, check the URI for the job target.

    Another trouble scenario is that the target model is unresponsive and connections eventually time out:

    2025-09-02 14:19:22,928  DEBUG  response_closed.complete
    2025-09-02 14:19:22,924  DEBUG  Encountered httpx.TimeoutException
    Traceback (most recent call last):
    File "/app/.venv/lib/python3.11/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
     yield
    ...
    
  3. If the issue is still unclear, get the target for the job and the try to run inference with the target model.