Troubleshoot#
Resolve common issues encountered when using NVIDIA NIM for Batched Geometry Relaxation.
Diagnostic Logs#
Errors typically appear in the following two places:
Docker logs: Check the container logs for startup, out-of-memory (OOM), and model-load issues:
docker logs alchemi-bgr
HTTP responses: Check HTTP 4xx or 5xx status codes and the response body for validation and request errors.
Container Fails to Start#
If the container fails to start or cannot download the model, verify your NGC API key configuration:
Export your
NGC_API_KEYin the shell and pass it to thedocker runcommand using-e NGC_API_KEY.Verify that the key is valid:
curl -s -H "Authorization: Bearer $NGC_API_KEY" https://api.ngc.nvidia.com/v2/org
Ensure that the host is logged in to the NGC Docker registry:
docker login nvcr.io
Out-of-Memory Errors#
GPU out-of-memory (OOM) errors appear in the container logs (for example,
CUDA out-of-memory). If an OOM error occurs during inference, do the following:
Check the maximum supported system size for the GPU by using the
/v1/statusendpoint:curl -s http://localhost:8000/v1/status | python3 -m json.tool
The
max_system_sizefield indicates the maximum number of atoms per batch.Reduce the batch size by setting the environment variable in your
docker runcommand:-e ALCHEMI_NIM_BATCH_SIZE=10000
Use a GPU with more memory. Refer to the Support Matrix for tested hardware.
Structures Do Not Converge#
If optimization results show "converged": false for many structures, do the
following:
Increase
ALCHEMI_NIM_BGR_MAX_STEPSbeyond the default of 2000:-e ALCHEMI_NIM_BGR_MAX_STEPS=5000
Loosen the force tolerance
ALCHEMI_NIM_BGR_OPTTOL(the default is 0.005 eV/Å):-e ALCHEMI_NIM_BGR_OPTTOL=0.01
Use the per-request
opttolparameter for individual problematic structures.Check whether the input structures have reasonable initial geometry. Severely distorted structures might not converge within the step limit.
PBC Mode Mismatch#
Periodic boundary conditions (PBC) mode is fixed at container startup. Mixing periodic and non-periodic structures in the same container instance is not supported.
For inorganic solids (periodic systems), start the container with
ALCHEMI_NIM_PBC=true, which is the default.For isolated molecules, start the container with
ALCHEMI_NIM_PBC=false.To run both workloads, launch separate containers for periodic and non-periodic workloads.
A PBC mode mismatch might be reported in the container logs or returned as an error response from the API.
Model Loading Failure#
If the container fails to load an externally mounted model, do the following:
Verify that the volume mount path matches
ALCHEMI_NIM_MODEL_PATH. Refer to Custom Models for per-model launch examples.Check file permissions. The container might run as a non-root user, so ensure that the model files are readable.
If the container crashes or fails to start when using a read-only volume mount (
:ro) for custom models, mount the model or cache directory as read-write (omit:ro). The NIM might require write access to the cache.
Empty or Invalid Atoms Input#
Empty atoms arrays or malformed structures might be rejected with a 422
validation error. Ensure each structure has at least two atoms and that the
request body is well-formed. Single-atom structures are not supported by all
backends.
Service Status During Startup or Restart#
During startup or restart, the NIM might report a non-ready status while it
loads models and estimates batch sizes. The readiness endpoint might return a
technical message (for example, Triton connection refused) until the service
is fully operational.
Wait until /v1/health/ready returns {"status":"ready"} before sending
inference requests.
If the /v1/health/ready endpoint does not return {"status":"ready"} after
several minutes, do the following:
Wait. The NIM performs batch size estimation at startup. This process can take 2 to 5 minutes depending on the GPU.
Check the container logs for errors:
docker logs alchemi-bgr
Ensure that the GPU has enough free memory. Other processes consuming GPU memory reduce the available batch size.