Running vLLM
To run a container, issue the appropriate command as explained in the Running A Container chapter in the NVIDIA Containers For Deep Learning Frameworks User’s Guide and specify the registry, repository, and tags. For more information about using NGC, refer to the NGC Container User Guide.
Before you begin
If you have Docker 19.03 or later, a typical command to launch the container is:
docker run --gpus all -it --rm nvcr.io/nvidia/vllm:xx.xx-py3
If you have Docker 19.02 or earlier, a typical command to launch the container is:
nvidia-docker run -it --rm -v nvcr.io/nvidia/vllm:xx.xx-py3
-
xx.xx is the container version. For example, 25.08.
vLLM can be run by importing it as a Python module:
export VLLM_ATTENTION_BACKEND=FLASHINFER
python3 -c "
from vllm import LLM
from vllm.sampling_params import SamplingParams
llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v0.4', trust_remote_code=True, gpu_memory_utilization=0.70)
sampling_params = SamplingParams(max_tokens=50, temperature=0.0)
prompts = [
'<s> NVIDIA loves vLLM \U0001F49A',
'<s> NVIDIA loves',
]
outputs = llm.generate(prompts, sampling_params)
"
vLLM can be deployed in a client–server configuration. Start the HTTP inference server inside the container:
python3 -m vllm.entrypoints.openai.api_server --model
nvidia/Llama-3.1-8B-Instruct-FP8 --trust-remote-code
--tensor-parallel-size $num_gpus --quantization fp8
--gpu-memory-utilization 0.90
From a client, issue a text-generation request by POST-ing to /generate with a JSON body containing the prompt and sampling parameters:
curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' -H
'accept: application/json' -H 'Content-Type:
application/json' -d '{
"model": "nvidia/Llama-3.1-8B-Instruct-FP8",
"max_tokens": 1024,
"messages": [{"role":"user", "content": "What is NVIDIA famous for?"}]
}'
See/workspace/README.md
inside the container for information on getting started and customizing your vLLM image.
You might want to pull in data and model descriptions from locations outside the container for use by vLLM. To accomplish this, the easiest method is to mount one or more host directories as Docker bind mounts. For example:
docker run --gpus all -it --rm -v local_dir:container_dir nvcr.io/nvidia/vllm:xx.xx-py3