Local vLLM

View as Markdown

NeMo Gym can launch and manage the vLLM server for you using LocalVLLMModel (in responses_api_models/local_vllm_model). LocalVLLMModel is a subclass of VLLMModel that spawns the vLLM engine and auto-configures the model server to use it. The Chat Completions to Responses API conversion is inherited from VLLMModel. See VLLMModel for details.

A single LocalVLLMModel deployment can back multiple model servers, even when they need different request-time settings (for example, sampling parameters or reasoning on or off). See Local vLLM Proxy for this configuration.

If you want to connect NeMo Gym to a vLLM server you start and manage yourself, use VLLMModel directly. See VLLMModel for more details.

Use LocalVLLMModel

Unlike VLLMModel, LocalVLLMModel does not require a separate vLLM install or a manual model download. vLLM is a transitive dependency of responses_api_models/local_vllm_model, and model weights are fetched from the Hugging Face Hub on first run (using HF_TOKEN from your environment if present). A single ng_run brings up both vLLM and NeMo Gym.

Several model configs ship with the server under responses_api_models/local_vllm_model/configs/ — see the Qwen/, openai/, and nvidia/ subdirectories for ready-to-use examples. To launch a single-node 8-GPU run with Qwen3-30B-A3B-Instruct-2507:

$config_paths="resources_servers/example_single_tool_call/configs/example_single_tool_call.yaml,\
>responses_api_models/local_vllm_model/configs/Qwen/Qwen3-30B-A3B-Instruct-2507.yaml"
$ng_run "+config_paths=[${config_paths}]"

Override the parallelism dimensions on the command line to match your node:

$ng_run "+config_paths=[${config_paths}]" \
> ++Qwen3-30B-A3B-Instruct-2507.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4 \
> ++Qwen3-30B-A3B-Instruct-2507.responses_api_models.local_vllm_model.vllm_serve_kwargs.data_parallel_size=2

Once the servers are up, call the agent to verify everything works end-to-end:

$python responses_api_agents/simple_agent/client.py

LocalVLLMModel configuration reference

LocalVLLMModel inherits all fields from VLLMModel (see VLLMModel configuration reference). It adds the following:

ParameterTypeDefaultDescription
vllm_serve_kwargsdictRequired. Arguments passed through to vllm serve. See vllm_serve_kwargs below.
vllm_serve_env_varsdictRequired. Environment variables for the vLLM process. Must include VLLM_RAY_DP_PACK_STRATEGY.
hf_homestr<cwd>/.cache/huggingfaceHugging Face cache directory. Set this if you have already downloaded weights elsewhere.
debugboolfalsePrint vLLM server logs to stderr.
show_vllm_engine_statsboolfalsePeriodically log vLLM engine throughput stats.
ray_worker_py_executablestrsys.executablePython interpreter Ray uses for worker processes.

Two inherited fields are auto-populated by LocalVLLMModel after vLLM spawns and should not be set in your config:

  • base_url: assigned to the URL of the vLLM process once it binds a port. Defaults to [].
  • api_key: defaults to "dummy". vLLM does not authenticate local connections.

vllm_serve_kwargs

Required keys (asserted at startup):

  • data_parallel_size
  • tensor_parallel_size
  • pipeline_parallel_size

LocalVLLMModel injects the following keys automatically. Do not set them in your config:

  • distributed_executor_backend: ray
  • data_parallel_backend: ray
  • host: 0.0.0.0
  • port (chosen from the free-port pool)
  • download_dir (derived from hf_home)

Commonly tuned keys (see the shipped configs for full examples):

1vllm_serve_kwargs:
2 data_parallel_size: 1
3 tensor_parallel_size: 8
4 pipeline_parallel_size: 1
5 gpu_memory_utilization: 0.9
6 trust_remote_code: true
7 enable_auto_tool_choice: true
8 tool_call_parser: hermes
9 model_loader_extra_config:
10 enable_multithread_load: true
11 num_threads: 16

Any flag accepted by vllm serve can be set under vllm_serve_kwargs. See the official vLLM serve reference for the full list.

vllm_serve_env_vars

Environment variables set in the vLLM process. VLLM_RAY_DP_PACK_STRATEGY is mandatory:

1vllm_serve_env_vars:
2 VLLM_RAY_DP_PACK_STRATEGY: strict

See Multi-node deployments for what strict and span mean.

Multi-node deployments

LocalVLLMModel uses Ray to place vLLM workers across nodes. The VLLM_RAY_DP_PACK_STRATEGY environment variable controls how worker groups are packed:

  • strict: each data-parallel replica must fit on a single node (tensor_parallel_size * pipeline_parallel_size ≤ GPUs per node). Use for single-node setups or when running multiple replicas, each constrained to one node.
  • span: a single model instance may span multiple nodes. Use when tensor_parallel_size * pipeline_parallel_size exceeds the GPU count of one node. When span is set, data_parallel_size_local is automatically unset.

Sample topologies

1 node, 1 instance (TP=8). The default for the shipped configs:

$config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml"
$ng_run "+config_paths=[${config_paths}]"

1 node, 1 instance (DP=2, TP=4). Split one node into two data-parallel replicas:

$config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml"
$ng_run "+config_paths=[${config_paths}]" \
> ++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.data_parallel_size=2 \
> ++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4

1 node, 2 instances (TP=4 each). Chain two model configs into one run; each gets half the node:

$config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml,\
>responses_api_models/local_vllm_model/configs/openai/gpt-oss-120b-reasoning-high.yaml"
$ng_run "+config_paths=[${config_paths}]" \
> ++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4 \
> ++gpt-oss-120b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4

2 nodes, 2 instances (TP=8 each). With strict packing, each replica stays on its own node:

$config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml,\
>responses_api_models/local_vllm_model/configs/openai/gpt-oss-120b-reasoning-high.yaml"
$ng_run "+config_paths=[${config_paths}]"

Inherited features

The following capabilities work the same as in VLLMModel. See VLLMModel configuration reference for details.

  • chat_template_kwargs: override chat template behavior per model.
  • extra_body: pass vLLM-specific request parameters (for example, guided_json, reasoning.effort).
  • return_token_id_information: enable for training workflows that need prompt_token_ids, generation_token_ids, and generation_log_probs.

Multi-endpoint replicas (the base_url: list[str] pattern used with VLLMModel) do not apply to LocalVLLMModel: each LocalVLLMModel instance manages exactly one vLLM engine. To run multiple replicas, define each as a separate server instance in your config (chain configs in config_paths, or define multiple top-level keys in a single YAML).