Troubleshooting Guide for NVIDIA Earth-2 FourCastNet NIM#
This guide provides solutions to common issues encountered when deploying and using the NVIDIA Earth-2 FourCastNet NIM.
Container and Deployment Issues#
Container Fails to Start#
Error Downloading Model#
Symptoms:
nimlib.exceptions.ManifestDownloadError: Error downloading manifest: The requested operation requires an API key, but none was found
Possible causes and solutions:
Missing or invalid NGC API key
echo $NGC_API_KEY
export NGC_API_KEY=<your_ngc_api_key>
Triton Model Failed to Launch (No GPU Visible)#
Symptoms:
Container logs mention no CUDA-capable device or no GPUs are available.
Possible causes and solutions:
NVIDIA Container Toolkit not configured
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
If this fails, reinstall or reconfigure the NVIDIA Container Toolkit.
Container Starts But Health Check Fails#
Symptoms:
/v1/health/readyreturns{"status":"not ready"}or HTTP 503.
Possible causes and solutions:
Model still downloading or initializing
docker ps
docker logs <container_name_or_id>
Insufficient shared memory
docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
-p 8000:8000 \
-e NGC_API_KEY \
nvcr.io/nim/nvidia/fourcastnet:2.0.0
If you have many CPU cores and see shared memory related errors, try increasing --shm-size (for example, 8g).
Port Already in Use#
Symptoms:
Error indicates
0.0.0.0:8000is already in use.
Solution:
sudo lsof -i :8000
docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
-p 8080:8000 \
-e NGC_API_KEY \
nvcr.io/nim/nvidia/fourcastnet:2.0.0
Update your client URLs to use port 8080 (for example, http://localhost:8080/v1/infer).
API and Inference Issues#
Request Timeout#
Symptoms:
requests.exceptions.ReadTimeoutClient-side timeout while the container continues computing
Solution:
Increase your client timeout and/or reduce simulation_length. For example:
import requests
url = "http://localhost:8000/v1/infer"
files = {"input_array": ("input_array", open("fcn_inputs.npy", "rb"))}
data = {"input_time": "2020-01-01T00:00:00Z", "simulation_length": 40}
headers = {"accept": "application/x-tar"}
r = requests.post(url, headers=headers, data=data, files=files, timeout=600)
r.raise_for_status()
HTTP 400 or 500 Errors on /v1/infer#
Common causes and solutions:
Invalid
input_timeformat: must be RFC 3339 without fractional seconds (for example,2023-01-01T00:00:00Z). See the OpenAPI schema on the API reference page.simulation_lengthout of range: ensure it is within the allowed range for the service (see OpenAPI schema).Invalid
variablesparameter: if provided, it must be a comma-separated list of variable IDs (for example,t2m,z500,u10m).
Data Format Issues#
Input Array Shape or Dtype Mismatch#
Symptoms:
HTTP 500 response
Server-side error about unexpected tensor shape or dtype
Solution:
Validate the input array before sending it:
import numpy as np
x = np.load("fcn_inputs.npy")
print("shape:", x.shape)
print("dtype:", x.dtype)
The OpenAPI schema requires input_array to have shape (batch size, 1, variables, latitudes, longitudes) and recommends FP32. The expected sizes for variables/latitudes/longitudes vary by model; refer to the model card linked in the quickstart guide.
Performance and Resource Issues#
Slow Requests#
Symptoms:
First request is slower than subsequent requests
Large forecasts take longer than expected
Solutions:
Start with a smaller
simulation_lengthand a smallervariablessubset to validate your pipeline.Use the benchmarking script on the performance page to estimate expected latencies for your hardware.
Air Gap / Offline Deployment Issues#
Model Not Found in Offline Mode#
Symptoms:
Container tries to download model assets but cannot reach NGC.
Solutions:
Use the caching workflow described in the deployment guide to pre-populate the model cache.
For fully offline deployments, ensure your mounted cache directory is readable by the container user.
Getting Additional Help#
If you continue to experience issues after trying these solutions:
Check container logs:
docker logs --tail 200 <container_name_or_id>
Include the following in any support request:
NIM container version
GPU model + driver version
Docker version
Input array shape/dtype
Request parameters (
input_time,simulation_length,variables)