Release Notes#

Release 1.0.3#

Summary#

AirGapped support
Fixes for model downloading, and cache corruption.

Language Models#

Llama 2 7B Chat
Llama 2 13B Chat
Llama 2 70B Chat
Llama 3 8B Instruct
Llama 3 70B Instruct

Known Issues#

Llama3 70b v1.0.3 - LoRA isn’t supported on 8 x GPU configuration

LLama2 70B vLLM FP16 TP2 profile restriction
NVIDIA has validated Llama2 70B on various configurations of H100, A100, and L40S GPUs. Llama2 70B runs on tp4 (four GPU) and tp8 (eight GPU) versions of H100, A100, and L40s; however, the tp2 (2 GPU) of L40S does not have enough memory to run Llama2 70B, and any attempt to run it on that platform can encounter a CUDA “out of memory” issue.

P-Tuning isn’t supported.

Empty metrics values on multi-GPU TensorRT-LLM model
Metrics items gpu_cache_usage_perc, num_request_max, num_requests_running, num_requests_waiting, and prompt_tokens_total won’t be reported for multi-GPU TensorRT-LLM model, because TensorRT-LLM currently doesn’t expose iteration statistics in orchestrator mode.

No tokenizer found error when running PEFT
This warning can be safely ignored.

Release 1.0.0#

Summary#

This is the first general release of NIM.

Language Models#

Llama 3 8B Instruct
Llama 3 70B Instruct
Mistral-7B-Instruct-v0.3
Mixtral-8x7B-v0.1
Mixtral-8x22B-v0.1

Known Issues#

P-Tuning isn’t supported.

Llama3 70b v1.0.0 - LoRA is not supported on non-optimized 8 x GPU configuration

Empty metrics values on multi-GPU TensorRT-LLM model
Metrics items gpu_cache_usage_perc, num_requests_running, and num_requests_waiting won’t be reported for multi-GPU TensorRT-LLM model, because TensorRT-LLM currently doesn’t expose iteration statistics in orchestrator mode.

No tokenizer found error when running PEFT
This warning can be safely ignored.