Troubleshooting Guide for NVIDIA Earth-2 Correction Diffusion NIM#
This guide provides solutions to common issues encountered when deploying and using the NVIDIA Earth-2 Correction Diffusion (CorrDiff) NIM.
Container and Deployment Issues#
Container Fails to Start#
Error Downloading Model:#
Symptoms:
nimlib.exceptions.ManifestDownloadError: Error downloading manifest: The requested operation requires an API key, but none was foun
Possible Causes and Solutions:
Missing or Invalid NGC API Key
# Verify your NGC_API_KEY is set echo $NGC_API_KEY # If empty, set it export NGC_API_KEY=<your_ngc_api_key>
Triton Model Failed to Launch#
Symptoms
In the logs, you mind see something like this:
INFO:__main__:W1009 18:10:13.950191 862160 pinned_memory_manager.cc:273] "Unable to allocate pinned system memory, pinned memory pool will not be available: no CUDA-capable device is detected"
INFO:__main__:I1009 18:10:13.950235 862160 cuda_memory_manager.cc:117] "CUDA memory pool disabled"
INFO:__main__:E1009 18:10:13.968103 862160 server.cc:248] "CudaDriverHelper has not been initialized."
INFO:__main__:I1009 18:10:13.970863 862160 model_lifecycle.cc:473] "loading: corrdiff:1"
INFO:__main__:E1009 18:10:13.971563 862160 model_lifecycle.cc:654] "failed to load 'corrdiff' version 1: Invalid argument: instance group corrdiff_0 of model corrdiff has kind KIND_GPU but no GPUs are available"
INFO:__main__:I1009 18:10:13.971589 862160 model_lifecycle.cc:789] "failed to load 'corrdiff'"
Possible Causes and Solutions:
NVIDIA Container Toolkit Not Properly Configured
# Test GPU access with Docker docker run --rm --runtime=nvidia --gpus all ...
If this fails, reinstall or reconfigure the NVIDIA Container Toolkit.
Insufficient Docker Resources
Ensure Docker has access to adequate disk space (minimum 64GB free)
Ensure GPU has enough VRAM to support CorrDiff
Container Starts But Health Check Fails#
Symptoms:
Container is running (
docker psshows it)/v1/health/readyreturns{"status":"not ready"}or 503 error/v1/health/livereturns{"status":"not live"}
Possible Causes and Solutions:
Model Still Downloading
The initial model download takes several minutes (~4GB model).
# Check container logs for download progress docker logs <container_name_or_id>
Wait for the message:
Started HTTPService at 0.0.0.0:8090Insufficient Shared Memory
The CorrDiff NIM requires adequate shared memory for Triton’s Python backend.
# Restart container with increased shared memory docker run --rm --runtime=nvidia --gpus all --shm-size 4g \ -p 8000:8000 \ -e NGC_API_KEY \ nvcr.io/nim/nvidia/corrdiff:1.1.0
Important
For systems with many CPU cores, you may need to increase
--shm-sizebeyond the default 4g. Try 8g or larger if you encounter shared memory errors.GPU Not Accessible
# Verify GPU visibility inside the container docker exec <container_id> nvidia-smi
If GPUs are not visible, check your
--gpusflag orCUDA_VISIBLE_DEVICESsetting.
Port Already in Use#
Symptoms:
Error:
Socket "0.0.0.0:8000" already in use
Solution:
# Check what's using port 8000
sudo lsof -i :8000
# Use a different port
docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
-p 8080:8000 \
-e NGC_API_KEY \
nvcr.io/nim/nvidia/corrdiff:1.1.0
# Update your inference URLs to use the new port
# http://localhost:8080/v1/infer
Data Format Issues#
Input Array Shape Mismatch#
Symptoms:
500 Internal Server Error on
/v1/inferError message about array dimensions
This an example for a missing variable:
INFO:__main__:2025-10-09 11:26:50 | ERROR | corrdiff:00 | 1.model:202 - The size of tensor a (36) must match the size of tensor b (37) at non-singleton dimension 0
"timestamp": "2025-10-09 11:26:50,444", "level": "ERROR", "message": "[StatusCode.INVALID_ARGUMENT] The size of tensor a (36) must match the size of tensor b (37) at non-singleton dimension 0"
ERROR:nimlib.nim_inference_api_builder.api:[StatusCode.INVALID_ARGUMENT] The size of tensor a (36) must match the size of tensor b (37) at non-singleton dimension 0
This is an example of a mismatch in the lat/lon domain:
"timestamp": "2025-10-09 11:29:44,818", "level": "ERROR", "message": "[StatusCode.INVALID_ARGUMENT] [request id: edb45a57-a53d-11f0-b140-d7c36a2dca65] unexpected shape for input 'INPUT' for model 'corrdiff'. Expected [-1,-1,-1,129,301], got [1,1,38,130,301]. "
ERROR:nimlib.nim_inference_api_builder.api:[StatusCode.INVALID_ARGUMENT] [request id: edb45a57-a53d-11f0-b140-d7c36a2dca65] unexpected shape for input 'INPUT' for model 'corrdiff'. Expected [-1,-1,-1,129,301], got [1,1,38,130,301].
Expected Format:
The CorrDiff US GEFS-HRRR model expects:
Shape:
(1, 1, 39, 129, 301)Batch size: 1 (required)
Ensemble: 1 (required)
Variables: 39 (7 GEFS select + 31 GEFS pressure + 1 lead time)
Latitude: 129 points
Longitude: 301 points
Data Type:
float32(recommended)Bounding Box: [225°, 21°N] to [300°, 53°N] (CONUS)
Solution:
# Verify your input array shape
import numpy as np
input_array = np.load("corrdiff_inputs.npy")
print(f"Shape: {input_array.shape}")
print(f"Dtype: {input_array.dtype}")
# Expected output:
# Shape: (1, 1, 39, 129, 301)
# Dtype: float32
If the shape is incorrect, regenerate the input using the Earth2Studio script from the quickstart guide.
Missing or Incorrect Variables#
Symptoms:
Model produces unexpected output
Variables appear in wrong order
Required Variables (in order):
GEFS Select Variables (0.25°):
u10m,v10m,t2m,r2m,sp,msl,tcwvGEFS Pressure Variables (interpolated to 0.25°):
u1000,u925,u850,u700,u500,u250,v1000,v925,v850,v700,v500,v250,z1000,z925,z850,z700,z500,z200,t1000,t925,t850,t700,t500,t100,r1000,r925,r850,r700,r500,r100Lead Time Field: Integer field denoting forecast lead time (in 3-hour increments)
Solution:
Use the exact variable lists from the quickstart guide and ensure they’re concatenated in the correct order.
Check that the untis for the z variables are in geopotential height, instead of just geopotential.
API and Inference Issues#
Request Timeout#
Symptoms:
requests.exceptions.ReadTimeoutTimeoutError: Request timed out
Solution:
Increase the timeout parameter in your request. Inference time varies by GPU and diffusion steps.
# Python example with appropriate timeout
import requests
# For 8 steps: use timeout >= 60s
# For 18 steps: use timeout >= 120s
# For multiple samples: multiply by number of samples
r = requests.post(
url,
headers=headers,
data={"samples": 2, "steps": 18},
files=files,
timeout=300 # 5 minutes for safety
)
# Bash example (curl has no default timeout)
curl -X POST \
--max-time 300 \
-F "input_array=@corrdiff_inputs.npy" \
-F "samples=2" \
-F "steps=18" \
-o output.tar \
http://localhost:8000/v1/infer
400 Internal Server Error#
Symptoms:
HTTP 400 response
Error message in response body
Common Causes:
Invalid Input Parameters
Check parameter ranges:
samples: 1-64steps: 1-64seed: 0-2147483647sampler: “euler” (only supported value)
Malformed Input Array
# Verify the array can be read import numpy as np try: arr = np.load("corrdiff_inputs.npy") print(f"Loaded successfully: {arr.shape}") except Exception as e: print(f"Error loading array: {e}")
Empty or Corrupted Output#
Symptoms:
Output tar file is empty
Cannot extract tar file
Output arrays are all zeros or NaN
Solution:
Check Response Status
r = requests.post(url, headers=headers, data=data, files=files, timeout=180) print(f"Status: {r.status_code}") if r.status_code != 200: print(f"Error: {r.content}")
Verify Tar File
# List contents tar -tvf output.tar # Should show files like: # -rw-r--r-- 0/0 60555392 1970-01-01 00:00 000_000.npy # -rw-r--r-- 0/0 60555392 1970-01-01 00:00 001_000.npy
Check Output Array
import numpy as np import tarfile with tarfile.open("output.tar") as tar: for member in tar.getmembers(): arr_file = tar.extractfile(member) data = np.load(arr_file) print(f"{member.name}: shape={data.shape}, min={data.min()}, max={data.max()}")
Performance and Resource Issues#
GPU Out of Memory (OOM)#
Symptoms:
CUDA out of memory errors in container logs
Container crashes during inference
Health check fails after first few inferences
Solution:
Reduce the target batch size using the EARTH2NIM_TARGET_BATCHSIZE environment variable:
docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
-p 8000:8000 \
-e NGC_API_KEY \
-e EARTH2NIM_TARGET_BATCHSIZE=4 \
nvcr.io/nim/nvidia/corrdiff:1.1.0
Slow Inference Performance#
Symptoms:
Inference takes much longer than expected
First inference is very slow
Solutions:
Model Warm-up Required
The CorrDiff NIM uses dynamic compilation. The first 2-3 inferences will be slower as the model warms up.
# Perform warm-up inferences for _ in range(3): r = requests.post(url, headers=headers, data=data, files=files, timeout=180) # Now measure actual performance
Suboptimal Batch Size
For GPUs with sufficient memory, ensure
EARTH2NIM_TARGET_BATCHSIZEis not set too low:# For H100/A100 80GB, use default or explicit 8 -e EARTH2NIM_TARGET_BATCHSIZE=8
GPU Not Being Used
# Monitor GPU utilization during inference nvidia-smi dmon -s u
If GPU utilization is low, check that the container has proper GPU access.
Network Latency
If the client is remote from the NIM, network latency can be significant for large data transfers.
# Use streaming for large sample counts # See quickstart guide for streaming example
Multiple Samples Take Too Long#
Symptoms:
Requesting many samples (>8) takes excessively long
Need results as they’re generated
Solution:
Use streaming responses to receive samples as they’re completed:
import io
import tarfile
import numpy as np
import requests
url = "http://localhost:8000/v1/infer"
files = {"input_array": ("input_array", open("corrdiff_inputs.npy", "rb"))}
data = {"samples": 16, "steps": 8, "seed": 0}
headers = {"accept": "application/x-tar"}
with requests.post(url, headers=headers, files=files, data=data,
timeout=300, stream=True) as resp:
resp.raise_for_status()
with tarfile.open(fileobj=resp.raw, mode="r|") as tar_stream:
for member in tar_stream:
arr_file = io.BytesIO()
arr_file.write(tar_stream.extractfile(member).read())
arr_file.seek(0)
data = np.load(arr_file)
print(f"Received {member.name}: {data.shape}")
# Process data immediately
Earth2Studio Integration Issues#
ImportError or Module Not Found#
Symptoms:
ModuleNotFoundError: No module named 'earth2studio'Import errors for Earth2Studio components
Solution:
# Ensure Earth2Studio is installed (>= 0.9.0)
pip install --upgrade earth2studio
# Verify installation
python -c "import earth2studio; print(earth2studio.__version__)"
Data Source Connection Errors#
Symptoms:
Connection timeoutwhen fetching GEFS dataHTTP errors from NOAA servers
Solutions:
Network Connectivity
Ensure your system can access external data sources:
# Test connection to NOAA servers curl -I https://www.ncei.noaa.gov/
Data Availability
GEFS data may not be immediately available for very recent timestamps. Use data from at least 6-12 hours ago:
from datetime import datetime, timedelta # Use data from 12 hours ago time = datetime.utcnow() - timedelta(hours=12) time = time.replace(hour=(time.hour // 6) * 6, minute=0, second=0)
Rate Limiting
If fetching multiple time steps, add delays between requests:
import time for lead_time in lead_times: input_array = fetch_input_gefs(time, lead_time) time.sleep(1) # Avoid rate limiting
Incorrect Data Fetching#
Symptoms:
Array shape is incorrect after fetching
Variables missing or in wrong order
Solution:
Use the exact fetching script from the quickstart guide. Common mistakes:
Wrong GEFS data source class
# Correct - use both data sources from earth2studio.data import GEFS_FX, GEFS_FX_721x1440 ds_gefs = GEFS_FX(cache=True) ds_gefs_select = GEFS_FX_721x1440(cache=True, member="gec00")
Incorrect bounding box
# Correct crop indices for CONUS select_data[:, 0, :, 148:277, 900:1201] # Results in 129x301
Missing lead time field
# Must include lead time as integer field lead_hour = int(lead_time.total_seconds() // (3 * 60 * 60)) * np.ones((1, 1, 129, 301))
Docker and System Issues#
Docker Daemon Not Running#
Symptoms:
Cannot connect to the Docker daemonIs the docker daemon running?
Solution:
# Start Docker daemon (Ubuntu/Debian)
sudo systemctl start docker
# Enable Docker to start on boot
sudo systemctl enable docker
# Check Docker status
sudo systemctl status docker
Insufficient Disk Space#
Symptoms:
Container fails to pull
Model download fails
“No space left on device” errors
Solution:
# Check available disk space (need ~64GB)
df -h
# Clean up Docker resources
docker system prune -a
# Remove unused images
docker image prune -a
# Remove unused volumes
docker volume prune
Docker Permission Denied#
Symptoms:
permission denied while trying to connect to the Docker daemon socket
Solution:
# Add user to docker group
sudo usermod -aG docker $USER
# Log out and log back in, then verify
docker ps
Air Gap / Offline Deployment Issues#
Model Not Found in Offline Mode#
Symptoms:
Container tries to download model even with cached files
“Unable to access NGC” errors in offline environment
Solution:
Verify Cache Structure
ls -la ~/.cache/nim/ # Should contain model directories with checkpoint files
Pre-populate Cache
On a machine with NGC access:
export LOCAL_NIM_CACHE=~/nim_offline_cache mkdir -p $LOCAL_NIM_CACHE # Run once with NGC access to download model docker run --rm --runtime=nvidia --gpus all --shm-size 4g \ -p 8000:8000 \ -e NGC_API_KEY \ -v $LOCAL_NIM_CACHE:/opt/nim/.cache \ -u $(id -u) \ nvcr.io/nim/nvidia/corrdiff:1.1.0 # Copy $LOCAL_NIM_CACHE to offline system
Launch Without NGC_API_KEY
# On offline system with pre-populated cache docker run --rm --runtime=nvidia --gpus all --shm-size 4g \ -p 8000:8000 \ -v $LOCAL_NIM_CACHE:/opt/nim/.cache \ -u $(id -u) \ nvcr.io/nim/nvidia/corrdiff:1.1.0
Getting Additional Help#
If you continue to experience issues after trying these solutions:
Check Container Logs
docker logs <container_name_or_id> docker logs --tail 100 <container_name_or_id> # Last 100 lines
Verify Prerequisites
Review the prerequisites page to ensure all requirements are met.
Check NIM Configuration
See the configuration guide for advanced settings.
Performance Tuning
Consult the performance page for optimization guidance.
Contact Support
For enterprise support, visit NVIDIA AI Enterprise Support.
Tip
When reporting issues, include:
NIM container version
GPU model and driver version
Docker version and configuration
Complete error messages from container logs
Input array shape and dtype
Request parameters (samples, steps, etc.)