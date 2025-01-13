Known Issues#

Llama3 70b v1.0.3 - LoRA isn’t supported on 8 x GPU configuration

LLama2 70B vLLM FP16 TP2 profile restriction

NVIDIA has validated Llama2 70B on various configurations of H100, A100, and L40S GPUs. Llama2 70B runs on tp4 (four GPU) and tp8 (eight GPU) versions of H100, A100, and L40s; however, the tp2 (2 GPU) of L40S does not have enough memory to run Llama2 70B, and any attempt to run it on that platform can encounter a CUDA “out of memory” issue.

P-Tuning isn’t supported.

Empty metrics values on multi-GPU TensorRT-LLM model

Metrics items gpu_cache_usage_perc , num_request_max , num_requests_running , num_requests_waiting , and prompt_tokens_total won’t be reported for multi-GPU TensorRT-LLM model, because TensorRT-LLM currently doesn’t expose iteration statistics in orchestrator mode.