Troubleshooting Guide for NVIDIA Earth-2 Correction Diffusion NIM#

This guide provides solutions to common issues encountered when deploying and using the NVIDIA Earth-2 Correction Diffusion (CorrDiff) NIM.

Container and Deployment Issues#

Container Fails to Start#

Error Downloading Model:#

Symptoms:

nimlib.exceptions.ManifestDownloadError: Error downloading manifest: The requested operation requires an API key, but none was foun

Possible Causes and Solutions:

  1. Missing or Invalid NGC API Key

    # Verify your NGC_API_KEY is set
    echo $NGC_API_KEY
    
    # If empty, set it
    export NGC_API_KEY=<your_ngc_api_key>
    

Triton Model Failed to Launch#

Symptoms

In the logs, you mind see something like this:

INFO:__main__:W1009 18:10:13.950191 862160 pinned_memory_manager.cc:273] "Unable to allocate pinned system memory, pinned memory pool will not be available: no CUDA-capable device is detected"
INFO:__main__:I1009 18:10:13.950235 862160 cuda_memory_manager.cc:117] "CUDA memory pool disabled"
INFO:__main__:E1009 18:10:13.968103 862160 server.cc:248] "CudaDriverHelper has not been initialized."
INFO:__main__:I1009 18:10:13.970863 862160 model_lifecycle.cc:473] "loading: corrdiff:1"
INFO:__main__:E1009 18:10:13.971563 862160 model_lifecycle.cc:654] "failed to load 'corrdiff' version 1: Invalid argument: instance group corrdiff_0 of model corrdiff has kind KIND_GPU but no GPUs are available"
INFO:__main__:I1009 18:10:13.971589 862160 model_lifecycle.cc:789] "failed to load 'corrdiff'"

Possible Causes and Solutions:

  1. NVIDIA Container Toolkit Not Properly Configured

    # Test GPU access with Docker
    docker run --rm --runtime=nvidia --gpus all ...
    

    If this fails, reinstall or reconfigure the NVIDIA Container Toolkit.

  2. Insufficient Docker Resources

    • Ensure Docker has access to adequate disk space (minimum 64GB free)

    • Ensure GPU has enough VRAM to support CorrDiff

Container Starts But Health Check Fails#

Symptoms:

  • Container is running (docker ps shows it)

  • /v1/health/ready returns {"status":"not ready"} or 503 error

  • /v1/health/live returns {"status":"not live"}

Possible Causes and Solutions:

  1. Model Still Downloading

    The initial model download takes several minutes (~4GB model).

    # Check container logs for download progress
    docker logs <container_name_or_id>
    

    Wait for the message: Started HTTPService at 0.0.0.0:8090

  2. Insufficient Shared Memory

    The CorrDiff NIM requires adequate shared memory for Triton’s Python backend.

    # Restart container with increased shared memory
    docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
        -p 8000:8000 \
        -e NGC_API_KEY \
        nvcr.io/nim/nvidia/corrdiff:1.1.0
    

    Important

    For systems with many CPU cores, you may need to increase --shm-size beyond the default 4g. Try 8g or larger if you encounter shared memory errors.

  3. GPU Not Accessible

    # Verify GPU visibility inside the container
    docker exec <container_id> nvidia-smi
    

    If GPUs are not visible, check your --gpus flag or CUDA_VISIBLE_DEVICES setting.

Port Already in Use#

Symptoms:

  • Error: Socket "0.0.0.0:8000" already in use

Solution:

# Check what's using port 8000
sudo lsof -i :8000

# Use a different port
docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
    -p 8080:8000 \
    -e NGC_API_KEY \
    nvcr.io/nim/nvidia/corrdiff:1.1.0

# Update your inference URLs to use the new port
# http://localhost:8080/v1/infer

Data Format Issues#

Input Array Shape Mismatch#

Symptoms:

  • 500 Internal Server Error on /v1/infer

  • Error message about array dimensions

This an example for a missing variable:

INFO:__main__:2025-10-09 11:26:50 | ERROR    | corrdiff:00 |  1.model:202 - The size of tensor a (36) must match the size of tensor b (37) at non-singleton dimension 0
"timestamp": "2025-10-09 11:26:50,444", "level": "ERROR", "message": "[StatusCode.INVALID_ARGUMENT] The size of tensor a (36) must match the size of tensor b (37) at non-singleton dimension 0"
ERROR:nimlib.nim_inference_api_builder.api:[StatusCode.INVALID_ARGUMENT] The size of tensor a (36) must match the size of tensor b (37) at non-singleton dimension 0

This is an example of a mismatch in the lat/lon domain:

"timestamp": "2025-10-09 11:29:44,818", "level": "ERROR", "message": "[StatusCode.INVALID_ARGUMENT] [request id: edb45a57-a53d-11f0-b140-d7c36a2dca65] unexpected shape for input 'INPUT' for model 'corrdiff'. Expected [-1,-1,-1,129,301], got [1,1,38,130,301]. "
ERROR:nimlib.nim_inference_api_builder.api:[StatusCode.INVALID_ARGUMENT] [request id: edb45a57-a53d-11f0-b140-d7c36a2dca65] unexpected shape for input 'INPUT' for model 'corrdiff'. Expected [-1,-1,-1,129,301], got [1,1,38,130,301]. 

Expected Format:

The CorrDiff US GEFS-HRRR model expects:

  • Shape: (1, 1, 39, 129, 301)

    • Batch size: 1 (required)

    • Ensemble: 1 (required)

    • Variables: 39 (7 GEFS select + 31 GEFS pressure + 1 lead time)

    • Latitude: 129 points

    • Longitude: 301 points

  • Data Type: float32 (recommended)

  • Bounding Box: [225°, 21°N] to [300°, 53°N] (CONUS)

Solution:

# Verify your input array shape
import numpy as np
input_array = np.load("corrdiff_inputs.npy")
print(f"Shape: {input_array.shape}")
print(f"Dtype: {input_array.dtype}")

# Expected output:
# Shape: (1, 1, 39, 129, 301)
# Dtype: float32

If the shape is incorrect, regenerate the input using the Earth2Studio script from the quickstart guide.

Missing or Incorrect Variables#

Symptoms:

  • Model produces unexpected output

  • Variables appear in wrong order

Required Variables (in order):

  1. GEFS Select Variables (0.25°): u10m, v10m, t2m, r2m, sp, msl, tcwv

  2. GEFS Pressure Variables (interpolated to 0.25°): u1000, u925, u850, u700, u500, u250, v1000, v925, v850, v700, v500, v250, z1000, z925, z850, z700, z500, z200, t1000, t925, t850, t700, t500, t100, r1000, r925, r850, r700, r500, r100

  3. Lead Time Field: Integer field denoting forecast lead time (in 3-hour increments)

Solution:

Use the exact variable lists from the quickstart guide and ensure they’re concatenated in the correct order.

Check that the untis for the z variables are in geopotential height, instead of just geopotential.

API and Inference Issues#

Request Timeout#

Symptoms:

  • requests.exceptions.ReadTimeout

  • TimeoutError: Request timed out

Solution:

Increase the timeout parameter in your request. Inference time varies by GPU and diffusion steps.

# Python example with appropriate timeout
import requests

# For 8 steps: use timeout >= 60s
# For 18 steps: use timeout >= 120s
# For multiple samples: multiply by number of samples

r = requests.post(
    url, 
    headers=headers, 
    data={"samples": 2, "steps": 18}, 
    files=files, 
    timeout=300  # 5 minutes for safety
)
# Bash example (curl has no default timeout)
curl -X POST \
    --max-time 300 \
    -F "input_array=@corrdiff_inputs.npy" \
    -F "samples=2" \
    -F "steps=18" \
    -o output.tar \
    http://localhost:8000/v1/infer

400 Internal Server Error#

Symptoms:

  • HTTP 400 response

  • Error message in response body

Common Causes:

  1. Invalid Input Parameters

    Check parameter ranges:

    • samples: 1-64

    • steps: 1-64

    • seed: 0-2147483647

    • sampler: “euler” (only supported value)

  2. Malformed Input Array

    # Verify the array can be read
    import numpy as np
    try:
        arr = np.load("corrdiff_inputs.npy")
        print(f"Loaded successfully: {arr.shape}")
    except Exception as e:
        print(f"Error loading array: {e}")
    

Empty or Corrupted Output#

Symptoms:

  • Output tar file is empty

  • Cannot extract tar file

  • Output arrays are all zeros or NaN

Solution:

  1. Check Response Status

    r = requests.post(url, headers=headers, data=data, files=files, timeout=180)
    print(f"Status: {r.status_code}")
    if r.status_code != 200:
        print(f"Error: {r.content}")
    
  2. Verify Tar File

    # List contents
    tar -tvf output.tar
    
    # Should show files like:
    # -rw-r--r-- 0/0  60555392 1970-01-01 00:00 000_000.npy
    # -rw-r--r-- 0/0  60555392 1970-01-01 00:00 001_000.npy
    
  3. Check Output Array

    import numpy as np
    import tarfile
    
    with tarfile.open("output.tar") as tar:
        for member in tar.getmembers():
            arr_file = tar.extractfile(member)
            data = np.load(arr_file)
            print(f"{member.name}: shape={data.shape}, min={data.min()}, max={data.max()}")
    

Performance and Resource Issues#

GPU Out of Memory (OOM)#

Symptoms:

  • CUDA out of memory errors in container logs

  • Container crashes during inference

  • Health check fails after first few inferences

Solution:

Reduce the target batch size using the EARTH2NIM_TARGET_BATCHSIZE environment variable:

docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
    -p 8000:8000 \
    -e NGC_API_KEY \
    -e EARTH2NIM_TARGET_BATCHSIZE=4 \
    nvcr.io/nim/nvidia/corrdiff:1.1.0

Slow Inference Performance#

Symptoms:

  • Inference takes much longer than expected

  • First inference is very slow

Solutions:

  1. Model Warm-up Required

    The CorrDiff NIM uses dynamic compilation. The first 2-3 inferences will be slower as the model warms up.

    # Perform warm-up inferences
    for _ in range(3):
        r = requests.post(url, headers=headers, data=data, files=files, timeout=180)
    
    # Now measure actual performance
    
  2. Suboptimal Batch Size

    For GPUs with sufficient memory, ensure EARTH2NIM_TARGET_BATCHSIZE is not set too low:

    # For H100/A100 80GB, use default or explicit 8
    -e EARTH2NIM_TARGET_BATCHSIZE=8
    
  3. GPU Not Being Used

    # Monitor GPU utilization during inference
    nvidia-smi dmon -s u
    

    If GPU utilization is low, check that the container has proper GPU access.

  4. Network Latency

    If the client is remote from the NIM, network latency can be significant for large data transfers.

    # Use streaming for large sample counts
    # See quickstart guide for streaming example
    

Multiple Samples Take Too Long#

Symptoms:

  • Requesting many samples (>8) takes excessively long

  • Need results as they’re generated

Solution:

Use streaming responses to receive samples as they’re completed:

import io
import tarfile
import numpy as np
import requests

url = "http://localhost:8000/v1/infer"
files = {"input_array": ("input_array", open("corrdiff_inputs.npy", "rb"))}
data = {"samples": 16, "steps": 8, "seed": 0}
headers = {"accept": "application/x-tar"}

with requests.post(url, headers=headers, files=files, data=data, 
                   timeout=300, stream=True) as resp:
    resp.raise_for_status()
    with tarfile.open(fileobj=resp.raw, mode="r|") as tar_stream:
        for member in tar_stream:
            arr_file = io.BytesIO()
            arr_file.write(tar_stream.extractfile(member).read())
            arr_file.seek(0)
            data = np.load(arr_file)
            print(f"Received {member.name}: {data.shape}")
            # Process data immediately

Earth2Studio Integration Issues#

ImportError or Module Not Found#

Symptoms:

  • ModuleNotFoundError: No module named 'earth2studio'

  • Import errors for Earth2Studio components

Solution:

# Ensure Earth2Studio is installed (>= 0.9.0)
pip install --upgrade earth2studio

# Verify installation
python -c "import earth2studio; print(earth2studio.__version__)"

Data Source Connection Errors#

Symptoms:

  • Connection timeout when fetching GEFS data

  • HTTP errors from NOAA servers

Solutions:

  1. Network Connectivity

    Ensure your system can access external data sources:

    # Test connection to NOAA servers
    curl -I https://www.ncei.noaa.gov/
    
  2. Data Availability

    GEFS data may not be immediately available for very recent timestamps. Use data from at least 6-12 hours ago:

    from datetime import datetime, timedelta
    
    # Use data from 12 hours ago
    time = datetime.utcnow() - timedelta(hours=12)
    time = time.replace(hour=(time.hour // 6) * 6, minute=0, second=0)
    
  3. Rate Limiting

    If fetching multiple time steps, add delays between requests:

    import time
    for lead_time in lead_times:
        input_array = fetch_input_gefs(time, lead_time)
        time.sleep(1)  # Avoid rate limiting
    

Incorrect Data Fetching#

Symptoms:

  • Array shape is incorrect after fetching

  • Variables missing or in wrong order

Solution:

Use the exact fetching script from the quickstart guide. Common mistakes:

  1. Wrong GEFS data source class

    # Correct - use both data sources
    from earth2studio.data import GEFS_FX, GEFS_FX_721x1440
    ds_gefs = GEFS_FX(cache=True)
    ds_gefs_select = GEFS_FX_721x1440(cache=True, member="gec00")
    
  2. Incorrect bounding box

    # Correct crop indices for CONUS
    select_data[:, 0, :, 148:277, 900:1201]  # Results in 129x301
    
  3. Missing lead time field

    # Must include lead time as integer field
    lead_hour = int(lead_time.total_seconds() // (3 * 60 * 60)) * np.ones((1, 1, 129, 301))
    

Docker and System Issues#

Docker Daemon Not Running#

Symptoms:

  • Cannot connect to the Docker daemon

  • Is the docker daemon running?

Solution:

# Start Docker daemon (Ubuntu/Debian)
sudo systemctl start docker

# Enable Docker to start on boot
sudo systemctl enable docker

# Check Docker status
sudo systemctl status docker

Insufficient Disk Space#

Symptoms:

  • Container fails to pull

  • Model download fails

  • “No space left on device” errors

Solution:

# Check available disk space (need ~64GB)
df -h

# Clean up Docker resources
docker system prune -a

# Remove unused images
docker image prune -a

# Remove unused volumes
docker volume prune

Docker Permission Denied#

Symptoms:

  • permission denied while trying to connect to the Docker daemon socket

Solution:

# Add user to docker group
sudo usermod -aG docker $USER

# Log out and log back in, then verify
docker ps

Air Gap / Offline Deployment Issues#

Model Not Found in Offline Mode#

Symptoms:

  • Container tries to download model even with cached files

  • “Unable to access NGC” errors in offline environment

Solution:

  1. Verify Cache Structure

    ls -la ~/.cache/nim/
    # Should contain model directories with checkpoint files
    
  2. Pre-populate Cache

    On a machine with NGC access:

    export LOCAL_NIM_CACHE=~/nim_offline_cache
    mkdir -p $LOCAL_NIM_CACHE
    
    # Run once with NGC access to download model
    docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
        -p 8000:8000 \
        -e NGC_API_KEY \
        -v $LOCAL_NIM_CACHE:/opt/nim/.cache \
        -u $(id -u) \
        nvcr.io/nim/nvidia/corrdiff:1.1.0
    
    # Copy $LOCAL_NIM_CACHE to offline system
    
  3. Launch Without NGC_API_KEY

    # On offline system with pre-populated cache
    docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
        -p 8000:8000 \
        -v $LOCAL_NIM_CACHE:/opt/nim/.cache \
        -u $(id -u) \
        nvcr.io/nim/nvidia/corrdiff:1.1.0
    

Getting Additional Help#

If you continue to experience issues after trying these solutions:

  1. Check Container Logs

    docker logs <container_name_or_id>
    docker logs --tail 100 <container_name_or_id>  # Last 100 lines
    
  2. Verify Prerequisites

    Review the prerequisites page to ensure all requirements are met.

  3. Check NIM Configuration

    See the configuration guide for advanced settings.

  4. Performance Tuning

    Consult the performance page for optimization guidance.

  5. Contact Support

    For enterprise support, visit NVIDIA AI Enterprise Support.

Tip

When reporting issues, include:

  • NIM container version

  • GPU model and driver version

  • Docker version and configuration

  • Complete error messages from container logs

  • Input array shape and dtype

  • Request parameters (samples, steps, etc.)