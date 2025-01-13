This is the latest version of NIM.

Note: Due to the large size of this model, it is only supported on a subset of GPUs and optimization targets. See the Support Matrix for more details.

Note: For the H100 TP8 FP8 Latency Profile, we have intermittently observed higher TTFT values at low concurrency values.

Added support for vLLM fallback profiles for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct

Known Issues#

LoRA is not supported for Llama 3.1 405B Instruct

vLLM profiles are not supported for Llama 3.1 405B Instruct

Throughput optimized profiles are not supported on A100 FP16 and H100 FP16 for Llama 3.1 405B Instruct

Cache deployment fails for air-gapped system or read-only volume for multi-GPU vLLM profile

Users deploying a cache into an air-gapped system or read-only volume and intending to use the multi-GPU vLLM profile must create the following JSON file from the system used to initially download and generate the cache:

echo '{ "0->0": false, "0->1": true, "1->0": true, "1->1": false }' > $NIM_CACHE_PATH /vllm/vllm/gpu_p2p_access_cache_for_0,1.json file

CUDA out of memory issue for Llama2 70b v1.0.3

The vllm-fp16-tp2 profile has been validated and is known to work on H100 x 2 and A100 x 2 configurations. Other types of GPUs might encounter a “CUDA out of memory” issue.

Llama 3.1 FP8 requires NVIDIA driver version >= 550