Troubleshoot#

Resolve common issues encountered when using NVIDIA NIM for Batch Molecular Dynamics.

Diagnostic Logs#

Errors typically appear in the following two places:

  • Docker logs: Check the container logs for startup, out-of-memory (OOM), and model-load issues using docker logs alchemi-bmd.

  • HTTP responses: Check the 4xx or 5xx status and the response body for validation and request errors.

Container Fails to Start#

If the container fails to start or cannot download the model, verify your NGC API key configuration:

  1. Export NGC_API_KEY in the shell and pass it to the container using -e NGC_API_KEY.

  2. Verify that the key is valid:

    curl -s -H "Authorization: Bearer $NGC_API_KEY" https://api.ngc.nvidia.com/v2/org
    
  3. Ensure that the host is logged in to the NGC Docker registry:

    docker login nvcr.io
    

Out-of-Memory (OOM) Errors#

GPU out-of-memory errors appear in the container logs (for example, CUDA out-of-memory). If an OOM error occurs during inference, do the following:

  1. Check the maximum supported system size for the GPU using the /v1/status endpoint:

    curl -s http://localhost:8000/v1/status | python3 -m json.tool
    

    The max_system_size field indicates the maximum number of atoms per batch.

  2. Reduce the batch size by setting the environment variable:

    -e ALCHEMI_NIM_BATCH_SIZE=10000
    
  3. Use a GPU with more memory. Refer to the Support Matrix for tested hardware.

Simulation Instability (NaN Energies)#

If simulations produce NaN energies or crash mid-trajectory, do the following:

  • Reduce the timestep: The default dt of 1.0 fs might be too large.

    "config": {"dt": 0.5}
    
  • Check input coordinates: Ensure that coordinates are physically reasonable. Overlapping atoms or extreme distortions cause numerical instability.

  • Adjust NPT settings: Reduce barostat_diag_max and barostat_shear_max to use more conservative cell moves.

  • Verify PBC mode: Periodic structures require ALCHEMI_NIM_PBC=true (the default) and a valid cell in the request. For isolated molecules, set ALCHEMI_NIM_PBC=false.

PBC Mode Mismatch#

Periodic boundary conditions (PBC) mode is fixed at container startup. Mixing periodic and non-periodic structures in the same container instance is not supported.

  • For inorganic solids (periodic systems), start the container with ALCHEMI_NIM_PBC=true, which is the default.

  • For isolated molecules, start the container with ALCHEMI_NIM_PBC=false.

  • To run both workloads, launch separate containers for periodic and non-periodic workloads.

A PBC mode mismatch is reported in the container logs or returned as an error response from the API.

Model Loading Failure#

If the container fails to load an externally mounted model, do the following:

  • Verify that the volume mount path matches ALCHEMI_NIM_MODEL_PATH. Refer to Custom Models for per-model launch examples.

  • Check file permissions. The container might run as a non-root user. Ensure that the model files are readable.

  • If using a read-only volume mount (:ro) for custom models causes crashes, mount the directory as read-write (omit :ro). The NIM might require write access to the cache.

Empty or Invalid Atoms Input#

Empty or malformed atoms structures might be rejected with a 422 validation error. Ensure each structure has at least two atoms (if required by the model) and that the request body is well-formed.

Service Status During Startup#

During startup or restart, the NIM might report a non-ready status while it loads models and estimates batch sizes. The readiness endpoint might return a technical message (for example, Triton connection refused) until the service is fully operational.

Wait until /v1/health/ready returns {"status":"ready"} before sending inference requests.

Health Check Not Ready#

If the /v1/health/ready endpoint does not return {"status":"ready"} after several minutes, do the following:

  • Wait. The NIM performs batch size estimation at startup. This process can take 2 to 5 minutes depending on the GPU.

  • Check the container logs for errors:

    docker logs alchemi-bmd
    
  • Ensure that the GPU has enough free memory. Other processes consuming GPU memory reduce the available batch size.