Troubleshooting Guide for NVIDIA Earth-2 FourCastNet NIM#

This guide provides solutions to common issues encountered when deploying and using the NVIDIA Earth-2 FourCastNet NIM.

Container and Deployment Issues#

Container Fails to Start#

Error Downloading Model#

Symptoms:

nimlib.exceptions.ManifestDownloadError: Error downloading manifest: The requested operation requires an API key, but none was found

Possible causes and solutions:

  • Missing or invalid NGC API key

echo $NGC_API_KEY
export NGC_API_KEY=<your_ngc_api_key>

Triton Model Failed to Launch (No GPU Visible)#

Symptoms:

  • Container logs mention no CUDA-capable device or no GPUs are available.

Possible causes and solutions:

  • NVIDIA Container Toolkit not configured

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

If this fails, reinstall or reconfigure the NVIDIA Container Toolkit.

Container Starts But Health Check Fails#

Symptoms:

  • /v1/health/ready returns {"status":"not ready"} or HTTP 503.

Possible causes and solutions:

  • Model still downloading or initializing

docker ps
docker logs <container_name_or_id>
  • Insufficient shared memory

docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
  -p 8000:8000 \
  -e NGC_API_KEY \
  nvcr.io/nim/nvidia/fourcastnet:2.0.0

If you have many CPU cores and see shared memory related errors, try increasing --shm-size (for example, 8g).

Port Already in Use#

Symptoms:

  • Error indicates 0.0.0.0:8000 is already in use.

Solution:

sudo lsof -i :8000
docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
  -p 8080:8000 \
  -e NGC_API_KEY \
  nvcr.io/nim/nvidia/fourcastnet:2.0.0

Update your client URLs to use port 8080 (for example, http://localhost:8080/v1/infer).

API and Inference Issues#

Request Timeout#

Symptoms:

  • requests.exceptions.ReadTimeout

  • Client-side timeout while the container continues computing

Solution:

Increase your client timeout and/or reduce simulation_length. For example:

import requests

url = "http://localhost:8000/v1/infer"
files = {"input_array": ("input_array", open("fcn_inputs.npy", "rb"))}
data = {"input_time": "2020-01-01T00:00:00Z", "simulation_length": 40}
headers = {"accept": "application/x-tar"}

r = requests.post(url, headers=headers, data=data, files=files, timeout=600)
r.raise_for_status()

HTTP 400 or 500 Errors on /v1/infer#

Common causes and solutions:

  • Invalid input_time format: must be RFC 3339 without fractional seconds (for example, 2023-01-01T00:00:00Z). See the OpenAPI schema on the API reference page.

  • simulation_length out of range: ensure it is within the allowed range for the service (see OpenAPI schema).

  • Invalid variables parameter: if provided, it must be a comma-separated list of variable IDs (for example, t2m,z500,u10m).

Data Format Issues#

Input Array Shape or Dtype Mismatch#

Symptoms:

  • HTTP 500 response

  • Server-side error about unexpected tensor shape or dtype

Solution:

Validate the input array before sending it:

import numpy as np

x = np.load("fcn_inputs.npy")
print("shape:", x.shape)
print("dtype:", x.dtype)

The OpenAPI schema requires input_array to have shape (batch size, 1, variables, latitudes, longitudes) and recommends FP32. The expected sizes for variables/latitudes/longitudes vary by model; refer to the model card linked in the quickstart guide.

Performance and Resource Issues#

Slow Requests#

Symptoms:

  • First request is slower than subsequent requests

  • Large forecasts take longer than expected

Solutions:

  • Start with a smaller simulation_length and a smaller variables subset to validate your pipeline.

  • Use the benchmarking script on the performance page to estimate expected latencies for your hardware.

Air Gap / Offline Deployment Issues#

Model Not Found in Offline Mode#

Symptoms:

  • Container tries to download model assets but cannot reach NGC.

Solutions:

  • Use the caching workflow described in the deployment guide to pre-populate the model cache.

  • For fully offline deployments, ensure your mounted cache directory is readable by the container user.

Getting Additional Help#

If you continue to experience issues after trying these solutions:

  • Check container logs:

docker logs --tail 200 <container_name_or_id>
  • Include the following in any support request:

    • NIM container version

    • GPU model + driver version

    • Docker version

    • Input array shape/dtype

    • Request parameters (input_time, simulation_length, variables)