Troubleshooting GPU Memory Out-of-Memory Errors#

Out-of-memory (OOM) errors are one of the most common startup failures when deploying LLM NIM containers. These errors occur when the model’s memory requirements exceed the available GPU VRAM, typically because of a mismatch between the selected model profile and the hardware, or because of misconfigured system parameters.

This guide explains how GPU memory is used during startup, how to identify which phase caused the OOM error, and how to resolve it.

How GPU Memory Is Used#

During startup, NIM and the vLLM backend allocate GPU memory in stages. The largest single consumer is the model weights. You can estimate weight memory from the model’s parameter count (typically listed on the model card, or available in the model.safetensors.index.json file under the metadata.total_size field) and the precision of the profile.

A good heuristic formula for per-GPU weight memory is:

weight_memory_per_gpu = total_parameters x bytes_per_parameter / TP

Where bytes_per_parameter depends on precision:

Precision

Bytes Per Parameter

BF16

2

FP16

2

FP8

1

INT4

0.5

NVFP4

0.5

Examples

Model

Precision

Tensor Parallelism
(TP)

Calculation

Memory per GPU

Notes

Llama 3.1 8B

BF16

1

8 billion x 2 bytes

16 GB

Fits on a single 24 GB GPU (for example, A10G or RTX 4090) with room for KV cache and overhead

Llama 3.3 70B

BF16

4

70 billion x 2 bytes / 4 GPUs

35 GB

Fits on four A100-40GB or four A100-80GB GPUs, with varying amounts of room remaining for KV cache

Llama 3.3 70B

FP8

2

70 billion x 1 byte / 2 GPUs

35 GB

Using FP8 halves the weight memory, allowing the same model to run on two GPUs instead of four

Beyond weights, GPU memory is also needed for KV cache, activations, communication buffers, and CUDA graphs. The --gpu-memory-utilization parameter (with a default value of 0.9) controls what fraction of total GPU memory is budgeted for model operations. The remainder is left unreserved for post-load operations such as CUDA graph compilation.

GPU Total Memory
├── Requested Budget = total x gpu_memory_utilization (default 0.9)
│   ├── Model Weights           (loaded from checkpoint files)
│   ├── Non-Torch Overhead      (NCCL communication buffers, CUDA context)
│   ├── Peak Activations        (intermediate computation tensors)
│   └── KV Cache                (fills all remaining budget)
│
└── Unreserved = total x (1 - gpu_memory_utilization)
    ├── CUDA Graphs             (compiled operation sequences, 1-5 GB)
    ├── Sampler Warmup          (logits pre-allocation)
    └── Fragmentation / rounding

The KV cache is allocated greedily – it expands to fill all remaining space within the budget after weights, activations, and overhead are accounted for. Everything that happens after KV cache allocation (CUDA graph capture, sampler warmup) must fit in the unreserved portion.

First Step - Determine Where the OOM Error Occurred#

OOM errors happen at different stages of startup, and the resolution depends on which stage failed. Check the container logs to identify the failing phase, then follow the corresponding section below. If the default log output does not provide enough detail, refer to Enable Detailed Logging to increase verbosity.

During Weight Loading#

Symptoms

The OOM error appears early in startup, during model loading, before any messages about KV cache or graph compilation. Typical log patterns include the following:

torch.OutOfMemoryError: CUDA out of memory.

Cause

The GPU does not have enough memory to hold the model weights at the selected precision and tensor parallelism (TP) degree. For example, a 70-billion-parameter model in BF16 requires approximately 140 GB of weight memory, which does not fit on a single 80 GB GPU.

Resolution

Run list-model-profiles to find profiles that distribute the model across more GPUs or use a lower-precision quantization:

docker run --rm --gpus=all \
  -p 8000:8000 \
  ${NIM_LLM_IMAGE} \
  list-model-profiles

Look for profiles with higher tensor parallelism (TP) or pipeline parallelism (PP), or profiles that use FP8 or NVFP4 precision – preferably one with native hardware support on your GPU to avoid a performance penalty. Refer to the weight estimation formula in How GPU Memory Is Used for a quick check of whether the weights fit. For details on profile selection, refer to Model Profiles and Selection.

During KV Cache Allocation#

KV cache allocation failures fall into two categories: insufficient memory (the KV cache genuinely does not fit) and memory fragmentation (the KV cache fits in total but the allocator cannot find contiguous blocks). Check the error message to determine which applies.

Insufficient Memory#

Symptoms

The error occurs after weights are loaded, during memory profiling or KV cache block allocation. Look for log messages mentioning KV cache, determine_available_memory, or block allocation failures. NIM may also print an advisory warning before the crash similar to the following:

WARNING: Estimated VRAM (45.2 GB) exceeds available GPU memory (39.6 GB).
Consider reducing context length with --max-model-len=4096 (estimated 30.1 GB).

Cause

The model’s default context length (max_position_embeddings from config.json) requires more KV cache memory than fits in the GPU budget after weights, activations, and overhead are subtracted. This is common when running a model with a long native context (for example, 128K tokens) on a GPU with limited free memory.

Resolution

Reduce the context length with --max-model-len. Use the value suggested in the NIM warning, or choose a value appropriate for your workload:

docker run --gpus=all \
  -p 8000:8000 \
  -e NGC_API_KEY \
  ${NIM_LLM_IMAGE} \
  --max-model-len 4096

Note

Reducing --max-model-len limits the maximum sequence length (input + output tokens) per request. Choose a value that fits your use case.

Memory Fragmentation#

Symptoms

The error occurs during KV cache allocation and reports that the free memory is less than the requested allocation size. The error also mentions a large amount of memory “reserved by PyTorch but unallocated” and suggests setting expandable_segments:True similar to the following:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.09 GiB.
GPU 0 has a total capacity of 31.36 GiB of which 1.02 GiB is free.
Of the allocated memory 23.18 GiB is allocated by PyTorch, and
6.48 GiB is reserved by PyTorch but unallocated. If reserved but
unallocated memory is large try setting
PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

Cause

The PyTorch CUDA memory allocator requires contiguous memory blocks for each tensor allocation. During model loading and torch.compile, many temporary allocations fragment the GPU address space into small, non-contiguous pieces. When KV cache allocation begins, the total free memory may be sufficient, but no single contiguous block is large enough for individual per-layer KV cache tensors (typically 1+ GiB each).

This is particularly common with hybrid architectures (such as NemotronH models that combine Mamba and attention layers) where torch.compile creates extensive temporary allocations.

Resolution

Set PYTORCH_ALLOC_CONF=expandable_segments:True to instruct PyTorch to use CUDA virtual memory APIs, which allow allocations to grow without requiring physically contiguous pages:

docker run --gpus=all \
  -p 8000:8000 \
  -e NGC_API_KEY \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  ${NIM_LLM_IMAGE}

This setting does not increase memory usage – it changes how the allocator manages address space, eliminating fragmentation as a failure mode. Expandable segments are not enabled by default in PyTorch due to unresolved CUDA IPC compatibility issues, but the feature is stable and recommended by PyTorch when fragmentation-related OOM errors occur.

During CUDA Graph Compilation or Warmup#

Symptoms

The OOM error appears after weights and KV cache are successfully allocated, during CUDA graph capture or sampler warmup. Typical log patterns include the following:

torch.OutOfMemoryError: CUDA out of memory.

preceded by messages such as:

Graph capturing finished in ...

or:

compile_or_warm_up_model

Cause

The unreserved memory (by default, 10% of total GPU memory) is not large enough to accommodate CUDA graph capture, which can consume 1 to 5 GB depending on GPU architecture and model size. Smaller GPUs are disproportionately affected because 10% of a 24 GB GPU is only 2.4 GB, while CUDA graphs on some architectures require more than that amount of memory.

Resolution

Reduce --gpu-memory-utilization to leave more memory unreserved. For example, lowering the value from 0.9 to 0.85 frees an additional 5% of GPU memory for CUDA graphs:

docker run --gpus=all \
  -p 8000:8000 \
  -e NGC_API_KEY \
  ${NIM_LLM_IMAGE} \
  --gpu-memory-utilization 0.85

Alternatively, disable CUDA graphs entirely by setting NIM_DISABLE_CUDA_GRAPH=1 (or equivalently, passing --enforce-eager as a CLI argument). This eliminates the graph capture memory requirement at the cost of reduced inference throughput:

docker run --gpus=all \
  -p 8000:8000 \
  -e NGC_API_KEY \
  -e NIM_DISABLE_CUDA_GRAPH=1 \
  ${NIM_LLM_IMAGE}

Enable Detailed Logging#

NIM prints a GPU Memory Report at startup when NIM_LOG_LEVEL is set to INFO (the default) or DEBUG.

To enable detailed logging, set the NIM_LOG_LEVEL environment variable:

docker run --gpus=all \
  -p 8000:8000 \
  -e NGC_API_KEY \
  -e NIM_LOG_LEVEL=INFO \
  ${NIM_LLM_IMAGE}