nemo_curator.utils.vllm_utils
nemo_curator.utils.vllm_utils
Shared vLLM setup utilities.
These helpers centralise the boilerplate that every vLLM-based inference stage
needs: finding a free port, initialising an :class:vllm.LLM engine with
automatic port-collision retry, and resolving an HuggingFace model ID to a
local snapshot path.
They were extracted from the Nemotron-Parse inference stage, which was the first stage in NeMo Curator to be tested at scale (320x H100). Future stages that use vLLM (video, text, audio) should import from here rather than duplicating this logic. See GitHub issue #1720 for the roadmap to wire these utilities into other modalities.
Module Contents
Functions
API
Create a :class:vllm.LLM instance with automatic port-collision retry.
vLLM selects a MASTER_PORT for the distributed backend at startup. On a
busy node the chosen port may already be in use, causing an
EADDRINUSE RuntimeError. This helper picks a fresh free port on
each attempt so that transient collisions are handled transparently.
Parameters
model_path:
Local path or HuggingFace model ID to load.
max_num_seqs:
Maximum number of sequences vLLM processes concurrently.
enforce_eager:
Disable CUDA graph capture (slower but uses less memory).
dtype:
Model weight dtype passed to vLLM (e.g. "bfloat16").
trust_remote_code:
Whether to trust remote code in the model repository.
limit_mm_per_prompt:
Multimodal token limits per prompt (e.g. {"image": 1}).
Defaults to {"image": 1} when None.
max_port_retries:
Number of port-pick attempts before re-raising the error.
Return a free TCP port on the local machine.
Resolve an HF model ID to a local snapshot path.
Uses local_files_only=True so that workers on compute nodes never
attempt to reach the internet. The model must be pre-downloaded (e.g.
via huggingface-cli download) before submitting the job.
Parameters
model_path: HuggingFace model ID or an already-local path. If the path is already a local directory it is returned unchanged.