Local vLLM | NeMo Gym

NeMo Gym can launch and manage the vLLM server for you using LocalVLLMModel (in responses_api_models/local_vllm_model). LocalVLLMModel is a subclass of VLLMModel that spawns the vLLM engine and auto-configures the model server to use it. The Chat Completions to Responses API conversion is inherited from VLLMModel. See VLLMModel for details.

A single LocalVLLMModel deployment can back multiple model servers, even when they need different request-time settings (for example, sampling parameters or reasoning on or off). See Local vLLM Proxy for this configuration.

If you want to connect NeMo Gym to a vLLM server you start and manage yourself, use VLLMModel directly. See VLLMModel for more details.

Use LocalVLLMModel

Unlike VLLMModel, LocalVLLMModel does not require a separate vLLM install or a manual model download. vLLM is a transitive dependency of responses_api_models/local_vllm_model, and model weights are fetched from the Hugging Face Hub on first run (using HF_TOKEN from your environment if present). A single ng_run brings up both vLLM and NeMo Gym.

Several model configs ship with the server under responses_api_models/local_vllm_model/configs/ — see the Qwen/, openai/, and nvidia/ subdirectories for ready-to-use examples. To launch a single-node 8-GPU run with Qwen3-30B-A3B-Instruct-2507:

$ config_paths="resources_servers/example_single_tool_call/configs/example_single_tool_call.yaml,\
> responses_api_models/local_vllm_model/configs/Qwen/Qwen3-30B-A3B-Instruct-2507.yaml"
$ ng_run "+config_paths=[${config_paths}]"

Override the parallelism dimensions on the command line to match your node:

$ ng_run "+config_paths=[${config_paths}]" \
>     ++Qwen3-30B-A3B-Instruct-2507.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4 \
>     ++Qwen3-30B-A3B-Instruct-2507.responses_api_models.local_vllm_model.vllm_serve_kwargs.data_parallel_size=2

Once the servers are up, call the agent to verify everything works end-to-end:

$ python responses_api_agents/simple_agent/client.py

LocalVLLMModel configuration reference

LocalVLLMModel inherits all fields from VLLMModel (see VLLMModel configuration reference). It adds the following:

Parameter	Type	Default	Description
`vllm_serve_kwargs`	`dict`	—	Required. Arguments passed through to `vllm serve`. See `vllm_serve_kwargs` below.
`vllm_serve_env_vars`	`dict`	—	Required. Environment variables for the vLLM process. Must include `VLLM_RAY_DP_PACK_STRATEGY`.
`hf_home`	`str`	`<cwd>/.cache/huggingface`	Hugging Face cache directory. Set this if you have already downloaded weights elsewhere.
`debug`	`bool`	`false`	Print vLLM server logs to stderr.
`show_vllm_engine_stats`	`bool`	`false`	Periodically log vLLM engine throughput stats.
`ray_worker_py_executable`	`str`	`sys.executable`	Python interpreter Ray uses for worker processes.

Two inherited fields are auto-populated by LocalVLLMModel after vLLM spawns and should not be set in your config:

base_url: assigned to the URL of the vLLM process once it binds a port. Defaults to [].
api_key: defaults to "dummy". vLLM does not authenticate local connections.

`vllm_serve_kwargs`

Required keys (asserted at startup):

data_parallel_size
tensor_parallel_size
pipeline_parallel_size

LocalVLLMModel injects the following keys automatically. Do not set them in your config:

distributed_executor_backend: ray
data_parallel_backend: ray
host: 0.0.0.0
port (chosen from the free-port pool)
download_dir (derived from hf_home)

Commonly tuned keys (see the shipped configs for full examples):

1 vllm_serve_kwargs:
2   data_parallel_size: 1
3   tensor_parallel_size: 8
4   pipeline_parallel_size: 1
5   gpu_memory_utilization: 0.9
6   trust_remote_code: true
7   enable_auto_tool_choice: true
8   tool_call_parser: hermes
9   model_loader_extra_config:
10     enable_multithread_load: true
11     num_threads: 16

Any flag accepted by vllm serve can be set under vllm_serve_kwargs. See the official vLLM serve reference for the full list.

`vllm_serve_env_vars`

Environment variables set in the vLLM process. VLLM_RAY_DP_PACK_STRATEGY is mandatory:

1 vllm_serve_env_vars:
2   VLLM_RAY_DP_PACK_STRATEGY: strict

See Multi-node deployments for what strict and span mean.

Multi-node deployments

LocalVLLMModel uses Ray to place vLLM workers across nodes. The VLLM_RAY_DP_PACK_STRATEGY environment variable controls how worker groups are packed:

strict: each data-parallel replica must fit on a single node (tensor_parallel_size * pipeline_parallel_size ≤ GPUs per node). Use for single-node setups or when running multiple replicas, each constrained to one node.
span: a single model instance may span multiple nodes. Use when tensor_parallel_size * pipeline_parallel_size exceeds the GPU count of one node. When span is set, data_parallel_size_local is automatically unset.

Sample topologies

1 node, 1 instance (TP=8). The default for the shipped configs:

$ config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml"
$ ng_run "+config_paths=[${config_paths}]"

1 node, 1 instance (DP=2, TP=4). Split one node into two data-parallel replicas:

$ config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml"
$ ng_run "+config_paths=[${config_paths}]" \
>     ++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.data_parallel_size=2 \
>     ++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4

1 node, 2 instances (TP=4 each). Chain two model configs into one run; each gets half the node:

$ config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml,\
> responses_api_models/local_vllm_model/configs/openai/gpt-oss-120b-reasoning-high.yaml"
$ ng_run "+config_paths=[${config_paths}]" \
>     ++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4 \
>     ++gpt-oss-120b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4

2 nodes, 2 instances (TP=8 each). With strict packing, each replica stays on its own node:

$ config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml,\
> responses_api_models/local_vllm_model/configs/openai/gpt-oss-120b-reasoning-high.yaml"
$ ng_run "+config_paths=[${config_paths}]"

Inherited features

The following capabilities work the same as in VLLMModel. See VLLMModel configuration reference for details.

chat_template_kwargs: override chat template behavior per model.
extra_body: pass vLLM-specific request parameters (for example, guided_json, reasoning.effort).
return_token_id_information: enable for training workflows that need prompt_token_ids, generation_token_ids, and generation_log_probs.

Multi-endpoint replicas (the base_url: list[str] pattern used with VLLMModel) do not apply to LocalVLLMModel: each LocalVLLMModel instance manages exactly one vLLM engine. To run multiple replicas, define each as a separate server instance in your config (chain configs in config_paths, or define multiple top-level keys in a single YAML).