Skip to content

ESM-2 vLLM Inference

This recipe demonstrates running inference on ESM-2 TE checkpoints using vLLM (>= 0.14) as a pooling/embedding model.

The exported TE checkpoints on HuggingFace Hub are directly compatible with vLLM. No conversion scripts or weight renaming are needed:

from vllm import LLM

model = LLM(
    model="nvidia/esm2_t6_8M_UR50D",
    runner="pooling",
    trust_remote_code=True,
    enforce_eager=True,
    max_num_batched_tokens=1026,
)

prompts = ["MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLK"]
outputs = model.embed(prompts)
print(outputs[0].outputs.embedding[:5])

See tests/test_vllm.py for a full golden-value validation across vLLM, native HuggingFace, and the nvidia Hub reference model.

Installing vLLM in the container

There are two ways to get vLLM installed in the Docker image.

Option 1: Build-time installation via Dockerfile build arg

Pass --build-arg INSTALL_VLLM=true and --build-arg TORCH_CUDA_ARCH_LIST=<arch> when building the image. TORCH_CUDA_ARCH_LIST is required when INSTALL_VLLM=true (the Dockerfile will error if it is not set):

docker build -t esm2-vllm \
  --build-arg INSTALL_VLLM=true \
  --build-arg TORCH_CUDA_ARCH_LIST="9.0" .

Option 2: Post-build installation via install_vllm.sh

Build the base image normally, then run install_vllm.sh inside the container. The script auto-detects the GPU architecture, or you can pass an explicit arch argument:

docker build -t esm2 .
docker run --rm -it --gpus all esm2 bash -c "./install_vllm.sh"
# or with an explicit architecture:
docker run --rm -it --gpus all esm2 bash -c "./install_vllm.sh 9.0"