NVIDIA Docs Hub Homepage NVIDIA Optimized Frameworks NVIDIA Optimized Frameworks Running vLLM

Running vLLM

Before you begin

To run a container, issue the appropriate command as explained in the Running A Container chapter in the NVIDIA Containers For Deep Learning Frameworks User’s Guide and specify the registry, repository, and tags. For more information about using NGC, refer to the NGC Container User Guide.

Before you begin

If you have Docker 19.03 or later, a typical command to launch the container is:

Copy
Copied!

            
            docker run --gpus all -it --rm nvcr.io/nvidia/vllm:xx.xx-py3

If you have Docker 19.02 or earlier, a typical command to launch the container is:

Copy
Copied!

            
            nvidia-docker run -it --rm -v nvcr.io/nvidia/vllm:xx.xx-py3

Where:

xx.xx is the container version. For example, 25.08.

vLLM can be run by importing it as a Python module:

Copy
Copied!

            
            export VLLM_ATTENTION_BACKEND=FLASHINFER
python3 -c "
from vllm import LLM
from vllm.sampling_params import SamplingParams
llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v0.4', trust_remote_code=True, gpu_memory_utilization=0.70)
sampling_params = SamplingParams(max_tokens=50, temperature=0.0)
prompts = [
 '<s> NVIDIA loves vLLM \U0001F49A',
 '<s> NVIDIA loves',
]
outputs = llm.generate(prompts, sampling_params)
"

vLLM can be deployed in a client–server configuration. Start the HTTP inference server inside the container:

Copy
Copied!

            
            python3 -m vllm.entrypoints.openai.api_server --model
nvidia/Llama-3.1-8B-Instruct-FP8 --trust-remote-code
--tensor-parallel-size $num_gpus --quantization fp8 
--gpu-memory-utilization 0.90

From a client, issue a text-generation request by POST-ing to /generate with a JSON body containing the prompt and sampling parameters:

Copy
Copied!

            
            curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' -H 
'accept: application/json' -H 'Content-Type: 
application/json' -d '{
"model": "nvidia/Llama-3.1-8B-Instruct-FP8",
"max_tokens": 1024,
"messages": [{"role":"user", "content": "What is NVIDIA famous for?"}]
}'

See/workspace/README.md inside the container for information on getting started and customizing your vLLM image.

You might want to pull in data and model descriptions from locations outside the container for use by vLLM. To accomplish this, the easiest method is to mount one or more host directories as Docker bind mounts. For example:

Copy
Copied!

            
            docker run --gpus all -it --rm -v local_dir:container_dir nvcr.io/nvidia/vllm:xx.xx-py3