Is this page helpful?

Troubleshooting Guide for NVIDIA Earth-2 Correction Diffusion NIM#

This guide provides solutions to common issues encountered when deploying and using the NVIDIA Earth-2 Correction Diffusion (CorrDiff) NIM.

Container and Deployment Issues#

Container Fails to Start#

Error Downloading Model:#

Symptoms:

nimlib.exceptions.ManifestDownloadError: Error downloading manifest: The requested operation requires an API key, but none was foun

Possible Causes and Solutions:

Missing or Invalid NGC API Key

# Verify your NGC_API_KEY is set
echo $NGC_API_KEY

# If empty, set it
export NGC_API_KEY=<your_ngc_api_key>

Triton Model Failed to Launch#

Symptoms

In the logs, you mind see something like this:

INFO:__main__:W1009 18:10:13.950191 862160 pinned_memory_manager.cc:273] "Unable to allocate pinned system memory, pinned memory pool will not be available: no CUDA-capable device is detected"
INFO:__main__:I1009 18:10:13.950235 862160 cuda_memory_manager.cc:117] "CUDA memory pool disabled"
INFO:__main__:E1009 18:10:13.968103 862160 server.cc:248] "CudaDriverHelper has not been initialized."
INFO:__main__:I1009 18:10:13.970863 862160 model_lifecycle.cc:473] "loading: corrdiff:1"
INFO:__main__:E1009 18:10:13.971563 862160 model_lifecycle.cc:654] "failed to load 'corrdiff' version 1: Invalid argument: instance group corrdiff_0 of model corrdiff has kind KIND_GPU but no GPUs are available"
INFO:__main__:I1009 18:10:13.971589 862160 model_lifecycle.cc:789] "failed to load 'corrdiff'"

Possible Causes and Solutions:

NVIDIA Container Toolkit Not Properly Configured
```
# Test GPU access with Docker
docker run --rm --runtime=nvidia --gpus all ...
```
If this fails, reinstall or reconfigure the NVIDIA Container Toolkit.
Insufficient Docker Resources
- Ensure Docker has access to adequate disk space (minimum 64GB free)
- Ensure GPU has enough VRAM to support CorrDiff

Container Starts But Health Check Fails#

Symptoms:

Container is running (docker ps shows it)
/v1/health/ready returns {"status":"not ready"} or 503 error
/v1/health/live returns {"status":"not live"}

Possible Causes and Solutions:

Model Still Downloading

The initial model download takes several minutes (~4GB model).
```
# Check container logs for download progress
docker logs <container_name_or_id>
```
Wait for the message: Started HTTPService at 0.0.0.0:8090
Insufficient Shared Memory

The CorrDiff NIM requires adequate shared memory for Triton’s Python backend.
```
# Restart container with increased shared memory
docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
    -p 8000:8000 \
    -e NGC_API_KEY \
    nvcr.io/nim/nvidia/corrdiff:1.1.0
```
Important

For systems with many CPU cores, you may need to increase --shm-size beyond the default 4g. Try 8g or larger if you encounter shared memory errors.
GPU Not Accessible
```
# Verify GPU visibility inside the container
docker exec <container_id> nvidia-smi
```
If GPUs are not visible, check your --gpus flag or CUDA_VISIBLE_DEVICES setting.

Port Already in Use#

Symptoms:

Error: Socket "0.0.0.0:8000" already in use

Solution:

# Check what's using port 8000
sudo lsof -i :8000

# Use a different port
docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
    -p 8080:8000 \
    -e NGC_API_KEY \
    nvcr.io/nim/nvidia/corrdiff:1.1.0

# Update your inference URLs to use the new port
# http://localhost:8080/v1/infer

Data Format Issues#

Input Array Shape Mismatch#

Symptoms:

500 Internal Server Error on /v1/infer
Error message about array dimensions

This an example for a missing variable:

INFO:__main__:2025-10-09 11:26:50 | ERROR    | corrdiff:00 |  1.model:202 - The size of tensor a (36) must match the size of tensor b (37) at non-singleton dimension 0
"timestamp": "2025-10-09 11:26:50,444", "level": "ERROR", "message": "[StatusCode.INVALID_ARGUMENT] The size of tensor a (36) must match the size of tensor b (37) at non-singleton dimension 0"
ERROR:nimlib.nim_inference_api_builder.api:[StatusCode.INVALID_ARGUMENT] The size of tensor a (36) must match the size of tensor b (37) at non-singleton dimension 0

This is an example of a mismatch in the lat/lon domain:

"timestamp": "2025-10-09 11:29:44,818", "level": "ERROR", "message": "[StatusCode.INVALID_ARGUMENT] [request id: edb45a57-a53d-11f0-b140-d7c36a2dca65] unexpected shape for input 'INPUT' for model 'corrdiff'. Expected [-1,-1,-1,129,301], got [1,1,38,130,301]. "
ERROR:nimlib.nim_inference_api_builder.api:[StatusCode.INVALID_ARGUMENT] [request id: edb45a57-a53d-11f0-b140-d7c36a2dca65] unexpected shape for input 'INPUT' for model 'corrdiff'. Expected [-1,-1,-1,129,301], got [1,1,38,130,301]. 

Expected Format:

The CorrDiff US GEFS-HRRR model expects:

Shape: (1, 1, 39, 129, 301)
- Batch size: 1 (required)
- Ensemble: 1 (required)
- Variables: 39 (7 GEFS select + 31 GEFS pressure + 1 lead time)
- Latitude: 129 points
- Longitude: 301 points
Data Type: float32 (recommended)
Bounding Box: [225°, 21°N] to [300°, 53°N] (CONUS)

Solution:

# Verify your input array shape
import numpy as np
input_array = np.load("corrdiff_inputs.npy")
print(f"Shape: {input_array.shape}")
print(f"Dtype: {input_array.dtype}")

# Expected output:
# Shape: (1, 1, 39, 129, 301)
# Dtype: float32

If the shape is incorrect, regenerate the input using the Earth2Studio script from the quickstart guide.

Missing or Incorrect Variables#

Symptoms:

Model produces unexpected output
Variables appear in wrong order

Required Variables (in order):

GEFS Select Variables (0.25°): u10m, v10m, t2m, r2m, sp, msl, tcwv
GEFS Pressure Variables (interpolated to 0.25°): u1000, u925, u850, u700, u500, u250, v1000, v925, v850, v700, v500, v250, z1000, z925, z850, z700, z500, z200, t1000, t925, t850, t700, t500, t100, r1000, r925, r850, r700, r500, r100
Lead Time Field: Integer field denoting forecast lead time (in 3-hour increments)

Solution:

Use the exact variable lists from the quickstart guide and ensure they’re concatenated in the correct order.

Check that the untis for the z variables are in geopotential height, instead of just geopotential.

API and Inference Issues#

Request Timeout#

Symptoms:

requests.exceptions.ReadTimeout
TimeoutError: Request timed out

Solution:

Increase the timeout parameter in your request. Inference time varies by GPU and diffusion steps.

# Python example with appropriate timeout
import requests

# For 8 steps: use timeout >= 60s
# For 18 steps: use timeout >= 120s
# For multiple samples: multiply by number of samples

r = requests.post(
    url, 
    headers=headers, 
    data={"samples": 2, "steps": 18}, 
    files=files, 
    timeout=300  # 5 minutes for safety
)

# Bash example (curl has no default timeout)
curl -X POST \
    --max-time 300 \
    -F "input_array=@corrdiff_inputs.npy" \
    -F "samples=2" \
    -F "steps=18" \
    -o output.tar \
    http://localhost:8000/v1/infer

400 Internal Server Error#

Symptoms:

HTTP 400 response
Error message in response body

Common Causes:

Invalid Input Parameters

Check parameter ranges:
- samples: 1-64
- steps: 1-64
- seed: 0-2147483647
- sampler: “euler” (only supported value)

Malformed Input Array

# Verify the array can be read
import numpy as np
try:
    arr = np.load("corrdiff_inputs.npy")
    print(f"Loaded successfully: {arr.shape}")
except Exception as e:
    print(f"Error loading array: {e}")

Empty or Corrupted Output#

Symptoms:

Output tar file is empty
Cannot extract tar file
Output arrays are all zeros or NaN

Solution:

Check Response Status

r = requests.post(url, headers=headers, data=data, files=files, timeout=180)
print(f"Status: {r.status_code}")
if r.status_code != 200:
    print(f"Error: {r.content}")

Verify Tar File

# List contents
tar -tvf output.tar

# Should show files like:
# -rw-r--r-- 0/0  60555392 1970-01-01 00:00 000_000.npy
# -rw-r--r-- 0/0  60555392 1970-01-01 00:00 001_000.npy

Check Output Array

import numpy as np
import tarfile

with tarfile.open("output.tar") as tar:
    for member in tar.getmembers():
        arr_file = tar.extractfile(member)
        data = np.load(arr_file)
        print(f"{member.name}: shape={data.shape}, min={data.min()}, max={data.max()}")

Performance and Resource Issues#

GPU Out of Memory (OOM)#

Symptoms:

CUDA out of memory errors in container logs
Container crashes during inference
Health check fails after first few inferences

Solution:

Reduce the target batch size using the EARTH2NIM_TARGET_BATCHSIZE environment variable:

docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
    -p 8000:8000 \
    -e NGC_API_KEY \
    -e EARTH2NIM_TARGET_BATCHSIZE=4 \
    nvcr.io/nim/nvidia/corrdiff:1.1.0

Slow Inference Performance#

Symptoms:

Inference takes much longer than expected
First inference is very slow

Solutions:

Model Warm-up Required

The CorrDiff NIM uses dynamic compilation. The first 2-3 inferences will be slower as the model warms up.

# Perform warm-up inferences
for _ in range(3):
    r = requests.post(url, headers=headers, data=data, files=files, timeout=180)

# Now measure actual performance

Suboptimal Batch Size

For GPUs with sufficient memory, ensure EARTH2NIM_TARGET_BATCHSIZE is not set too low:
```
# For H100/A100 80GB, use default or explicit 8
-e EARTH2NIM_TARGET_BATCHSIZE=8
```
GPU Not Being Used
```
# Monitor GPU utilization during inference
nvidia-smi dmon -s u
```
If GPU utilization is low, check that the container has proper GPU access.
Network Latency

If the client is remote from the NIM, network latency can be significant for large data transfers.
```
# Use streaming for large sample counts
# See quickstart guide for streaming example
```

Multiple Samples Take Too Long#

Symptoms:

Requesting many samples (>8) takes excessively long
Need results as they’re generated

Solution:

Use streaming responses to receive samples as they’re completed:

import io
import tarfile
import numpy as np
import requests

url = "http://localhost:8000/v1/infer"
files = {"input_array": ("input_array", open("corrdiff_inputs.npy", "rb"))}
data = {"samples": 16, "steps": 8, "seed": 0}
headers = {"accept": "application/x-tar"}

with requests.post(url, headers=headers, files=files, data=data, 
                   timeout=300, stream=True) as resp:
    resp.raise_for_status()
    with tarfile.open(fileobj=resp.raw, mode="r|") as tar_stream:
        for member in tar_stream:
            arr_file = io.BytesIO()
            arr_file.write(tar_stream.extractfile(member).read())
            arr_file.seek(0)
            data = np.load(arr_file)
            print(f"Received {member.name}: {data.shape}")
            # Process data immediately

Earth2Studio Integration Issues#

ImportError or Module Not Found#

Symptoms:

ModuleNotFoundError: No module named 'earth2studio'
Import errors for Earth2Studio components

Solution:

# Ensure Earth2Studio is installed (>= 0.9.0)
pip install --upgrade earth2studio

# Verify installation
python -c "import earth2studio; print(earth2studio.__version__)"

Data Source Connection Errors#

Symptoms:

Connection timeout when fetching GEFS data
HTTP errors from NOAA servers

Solutions:

Network Connectivity

Ensure your system can access external data sources:

# Test connection to NOAA servers
curl -I https://www.ncei.noaa.gov/

Data Availability

GEFS data may not be immediately available for very recent timestamps. Use data from at least 6-12 hours ago:

from datetime import datetime, timedelta

# Use data from 12 hours ago
time = datetime.utcnow() - timedelta(hours=12)
time = time.replace(hour=(time.hour // 6) * 6, minute=0, second=0)

Rate Limiting

If fetching multiple time steps, add delays between requests:

import time
for lead_time in lead_times:
    input_array = fetch_input_gefs(time, lead_time)
    time.sleep(1)  # Avoid rate limiting

Incorrect Data Fetching#

Symptoms:

Array shape is incorrect after fetching
Variables missing or in wrong order

Solution:

Use the exact fetching script from the quickstart guide. Common mistakes:

Wrong GEFS data source class

# Correct - use both data sources
from earth2studio.data import GEFS_FX, GEFS_FX_721x1440
ds_gefs = GEFS_FX(cache=True)
ds_gefs_select = GEFS_FX_721x1440(cache=True, member="gec00")

Incorrect bounding box

# Correct crop indices for CONUS
select_data[:, 0, :, 148:277, 900:1201]  # Results in 129x301

Missing lead time field

# Must include lead time as integer field
lead_hour = int(lead_time.total_seconds() // (3 * 60 * 60)) * np.ones((1, 1, 129, 301))

Docker and System Issues#

Docker Daemon Not Running#

Symptoms:

Cannot connect to the Docker daemon
Is the docker daemon running?

Solution:

# Start Docker daemon (Ubuntu/Debian)
sudo systemctl start docker

# Enable Docker to start on boot
sudo systemctl enable docker

# Check Docker status
sudo systemctl status docker

Insufficient Disk Space#

Symptoms:

Container fails to pull
Model download fails
“No space left on device” errors

Solution:

# Check available disk space (need ~64GB)
df -h

# Clean up Docker resources
docker system prune -a

# Remove unused images
docker image prune -a

# Remove unused volumes
docker volume prune

Docker Permission Denied#

Symptoms:

permission denied while trying to connect to the Docker daemon socket

Solution:

# Add user to docker group
sudo usermod -aG docker $USER

# Log out and log back in, then verify
docker ps

Air Gap / Offline Deployment Issues#

Model Not Found in Offline Mode#

Symptoms:

Container tries to download model even with cached files
“Unable to access NGC” errors in offline environment

Solution:

Verify Cache Structure

ls -la ~/.cache/nim/
# Should contain model directories with checkpoint files

Pre-populate Cache

On a machine with NGC access:

export LOCAL_NIM_CACHE=~/nim_offline_cache
mkdir -p $LOCAL_NIM_CACHE

# Run once with NGC access to download model
docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
    -p 8000:8000 \
    -e NGC_API_KEY \
    -v $LOCAL_NIM_CACHE:/opt/nim/.cache \
    -u $(id -u) \
    nvcr.io/nim/nvidia/corrdiff:1.1.0

# Copy $LOCAL_NIM_CACHE to offline system

Launch Without NGC_API_KEY

# On offline system with pre-populated cache
docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
    -p 8000:8000 \
    -v $LOCAL_NIM_CACHE:/opt/nim/.cache \
    -u $(id -u) \
    nvcr.io/nim/nvidia/corrdiff:1.1.0

Getting Additional Help#

If you continue to experience issues after trying these solutions:

Check Container Logs

docker logs <container_name_or_id>
docker logs --tail 100 <container_name_or_id>  # Last 100 lines

Verify Prerequisites

Review the prerequisites page to ensure all requirements are met.
Check NIM Configuration

See the configuration guide for advanced settings.
Performance Tuning

Consult the performance page for optimization guidance.
Contact Support

For enterprise support, visit NVIDIA AI Enterprise Support.

Tip

When reporting issues, include:

NIM container version
GPU model and driver version
Docker version and configuration
Complete error messages from container logs
Input array shape and dtype
Request parameters (samples, steps, etc.)