Local vLLM
NeMo Gym can launch and manage the vLLM server for you using LocalVLLMModel (in responses_api_models/local_vllm_model).
LocalVLLMModel is a subclass of VLLMModel that spawns the vLLM engine and auto-configures the model server to use it.
The Chat Completions to Responses API conversion is inherited from VLLMModel. See VLLMModel for details.
A single LocalVLLMModel deployment can back multiple model servers, even when they need different request-time settings (for example, sampling parameters or reasoning on or off). See Local vLLM Proxy for this configuration.
If you want to connect NeMo Gym to a vLLM server you start and manage yourself, use VLLMModel directly. See VLLMModel for more details.
Use LocalVLLMModel
Unlike VLLMModel, LocalVLLMModel does not require a separate vLLM install or a manual model download. vLLM is a transitive dependency of responses_api_models/local_vllm_model, and model weights are fetched from the Hugging Face Hub on first run (using HF_TOKEN from your environment if present). A single ng_run brings up both vLLM and NeMo Gym.
Several model configs ship with the server under responses_api_models/local_vllm_model/configs/ — see the Qwen/, openai/, and nvidia/ subdirectories for ready-to-use examples. To launch a single-node 8-GPU run with Qwen3-30B-A3B-Instruct-2507:
Override the parallelism dimensions on the command line to match your node:
Once the servers are up, call the agent to verify everything works end-to-end:
LocalVLLMModel configuration reference
LocalVLLMModel inherits all fields from VLLMModel (see VLLMModel configuration reference). It adds the following:
Two inherited fields are auto-populated by LocalVLLMModel after vLLM spawns and should not be set in your config:
base_url: assigned to the URL of the vLLM process once it binds a port. Defaults to[].api_key: defaults to"dummy". vLLM does not authenticate local connections.
vllm_serve_kwargs
Required keys (asserted at startup):
data_parallel_sizetensor_parallel_sizepipeline_parallel_size
LocalVLLMModel injects the following keys automatically. Do not set them in your config:
distributed_executor_backend: raydata_parallel_backend: rayhost: 0.0.0.0port(chosen from the free-port pool)download_dir(derived fromhf_home)
Commonly tuned keys (see the shipped configs for full examples):
Any flag accepted by vllm serve can be set under vllm_serve_kwargs. See the official vLLM serve reference for the full list.
vllm_serve_env_vars
Environment variables set in the vLLM process. VLLM_RAY_DP_PACK_STRATEGY is mandatory:
See Multi-node deployments for what strict and span mean.
Multi-node deployments
LocalVLLMModel uses Ray to place vLLM workers across nodes. The VLLM_RAY_DP_PACK_STRATEGY environment variable controls how worker groups are packed:
strict: each data-parallel replica must fit on a single node (tensor_parallel_size * pipeline_parallel_size ≤ GPUs per node). Use for single-node setups or when running multiple replicas, each constrained to one node.span: a single model instance may span multiple nodes. Use whentensor_parallel_size * pipeline_parallel_sizeexceeds the GPU count of one node. Whenspanis set,data_parallel_size_localis automatically unset.
Sample topologies
1 node, 1 instance (TP=8). The default for the shipped configs:
1 node, 1 instance (DP=2, TP=4). Split one node into two data-parallel replicas:
1 node, 2 instances (TP=4 each). Chain two model configs into one run; each gets half the node:
2 nodes, 2 instances (TP=8 each). With strict packing, each replica stays on its own node:
Inherited features
The following capabilities work the same as in VLLMModel. See VLLMModel configuration reference for details.
chat_template_kwargs: override chat template behavior per model.extra_body: pass vLLM-specific request parameters (for example,guided_json,reasoning.effort).return_token_id_information: enable for training workflows that needprompt_token_ids,generation_token_ids, andgeneration_log_probs.
Multi-endpoint replicas (the base_url: list[str] pattern used with VLLMModel) do not apply to LocalVLLMModel: each LocalVLLMModel instance manages exactly one vLLM engine. To run multiple replicas, define each as a separate server instance in your config (chain configs in config_paths, or define multiple top-level keys in a single YAML).