> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/gym/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/gym/_mcp/server.

# Local vLLM

> Gym-managed vLLM server deployment

NeMo Gym can launch and manage the vLLM server for you using LocalVLLMModel (in `responses_api_models/local_vllm_model`).
LocalVLLMModel is a subclass of VLLMModel that spawns the vLLM engine and auto-configures the model server to use it.
The Chat Completions to Responses API conversion is inherited from VLLMModel. See [VLLMModel](/model-server/vllm) for details.

A single LocalVLLMModel deployment can back multiple model servers, even when they need different request-time settings (for example, sampling parameters or reasoning on or off).
See [Local vLLM Proxy](/model-server/local-vllm-proxy) for this configuration.

If you want to connect NeMo Gym to a vLLM server you start and manage yourself, use VLLMModel directly. See [VLLMModel](/model-server/vllm) for more details.

## Use LocalVLLMModel

Unlike VLLMModel, LocalVLLMModel does not require a separate vLLM install or a manual model download. vLLM is a transitive dependency of `responses_api_models/local_vllm_model`, and model weights are fetched from the Hugging Face Hub on first run (using `HF_TOKEN` from your environment if present). A single `ng_run` brings up both vLLM and NeMo Gym.

Several model configs ship with the server under `responses_api_models/local_vllm_model/configs/` — see the `Qwen/`, `openai/`, and `nvidia/` subdirectories for ready-to-use examples. To launch a single-node 8-GPU run with `Qwen3-30B-A3B-Instruct-2507`:

```bash
config_paths="resources_servers/example_single_tool_call/configs/example_single_tool_call.yaml,\
responses_api_models/local_vllm_model/configs/Qwen/Qwen3-30B-A3B-Instruct-2507.yaml"
ng_run "+config_paths=[${config_paths}]"
```

Override the parallelism dimensions on the command line to match your node:

```bash
ng_run "+config_paths=[${config_paths}]" \
    ++Qwen3-30B-A3B-Instruct-2507.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4 \
    ++Qwen3-30B-A3B-Instruct-2507.responses_api_models.local_vllm_model.vllm_serve_kwargs.data_parallel_size=2
```

Once the servers are up, call the agent to verify everything works end-to-end:

```bash
python responses_api_agents/simple_agent/client.py
```

## LocalVLLMModel configuration reference

LocalVLLMModel inherits all fields from VLLMModel (see [VLLMModel configuration reference](/model-server/vllm#vllmmodel-configuration-reference)). It adds the following:

| Parameter                  | Type   | Default                    | Description                                                                                         |
| -------------------------- | ------ | -------------------------- | --------------------------------------------------------------------------------------------------- |
| `vllm_serve_kwargs`        | `dict` | —                          | **Required.** Arguments passed through to `vllm serve`. See `vllm_serve_kwargs` below.              |
| `vllm_serve_env_vars`      | `dict` | —                          | **Required.** Environment variables for the vLLM process. Must include `VLLM_RAY_DP_PACK_STRATEGY`. |
| `hf_home`                  | `str`  | `<cwd>/.cache/huggingface` | Hugging Face cache directory. Set this if you have already downloaded weights elsewhere.            |
| `debug`                    | `bool` | `false`                    | Print vLLM server logs to stderr.                                                                   |
| `show_vllm_engine_stats`   | `bool` | `false`                    | Periodically log vLLM engine throughput stats.                                                      |
| `ray_worker_py_executable` | `str`  | `sys.executable`           | Python interpreter Ray uses for worker processes.                                                   |

Two inherited fields are auto-populated by LocalVLLMModel after vLLM spawns and should **not** be set in your config:

* `base_url`: assigned to the URL of the vLLM process once it binds a port. Defaults to `[]`.
* `api_key`: defaults to `"dummy"`. vLLM does not authenticate local connections.

### `vllm_serve_kwargs`

Required keys (asserted at startup):

* `data_parallel_size`
* `tensor_parallel_size`
* `pipeline_parallel_size`

LocalVLLMModel injects the following keys automatically. Do not set them in your config:

* `distributed_executor_backend: ray`
* `data_parallel_backend: ray`
* `host: 0.0.0.0`
* `port` (chosen from the free-port pool)
* `download_dir` (derived from `hf_home`)

Commonly tuned keys (see the shipped configs for full examples):

```yaml
vllm_serve_kwargs:
  data_parallel_size: 1
  tensor_parallel_size: 8
  pipeline_parallel_size: 1
  gpu_memory_utilization: 0.9
  trust_remote_code: true
  enable_auto_tool_choice: true
  tool_call_parser: hermes
  model_loader_extra_config:
    enable_multithread_load: true
    num_threads: 16
```

Any flag accepted by `vllm serve` can be set under `vllm_serve_kwargs`. See the [official vLLM serve reference](https://docs.vllm.ai/en/latest/cli/serve/) for the full list.

### `vllm_serve_env_vars`

Environment variables set in the vLLM process. `VLLM_RAY_DP_PACK_STRATEGY` is mandatory:

```yaml
vllm_serve_env_vars:
  VLLM_RAY_DP_PACK_STRATEGY: strict
```

See [Multi-node deployments](#multi-node-deployments) for what `strict` and `span` mean.

## Multi-node deployments

LocalVLLMModel uses Ray to place vLLM workers across nodes. The `VLLM_RAY_DP_PACK_STRATEGY` environment variable controls how worker groups are packed:

* **`strict`**: each data-parallel replica must fit on a single node (`tensor_parallel_size * pipeline_parallel_size ≤ GPUs per node`). Use for single-node setups or when running multiple replicas, each constrained to one node.
* **`span`**: a single model instance may span multiple nodes. Use when `tensor_parallel_size * pipeline_parallel_size` exceeds the GPU count of one node. When `span` is set, `data_parallel_size_local` is automatically unset.

### Sample topologies

**1 node, 1 instance (TP=8).** The default for the shipped configs:

```bash
config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml"
ng_run "+config_paths=[${config_paths}]"
```

**1 node, 1 instance (DP=2, TP=4).** Split one node into two data-parallel replicas:

```bash
config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml"
ng_run "+config_paths=[${config_paths}]" \
    ++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.data_parallel_size=2 \
    ++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4
```

**1 node, 2 instances (TP=4 each).** Chain two model configs into one run; each gets half the node:

```bash
config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml,\
responses_api_models/local_vllm_model/configs/openai/gpt-oss-120b-reasoning-high.yaml"
ng_run "+config_paths=[${config_paths}]" \
    ++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4 \
    ++gpt-oss-120b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4
```

**2 nodes, 2 instances (TP=8 each).** With `strict` packing, each replica stays on its own node:

```bash
config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml,\
responses_api_models/local_vllm_model/configs/openai/gpt-oss-120b-reasoning-high.yaml"
ng_run "+config_paths=[${config_paths}]"
```

## Inherited features

The following capabilities work the same as in VLLMModel. See [VLLMModel configuration reference](/model-server/vllm#vllmmodel-configuration-reference) for details.

* **`chat_template_kwargs`**: override chat template behavior per model.
* **`extra_body`**: pass vLLM-specific request parameters (for example, `guided_json`, `reasoning.effort`).
* **`return_token_id_information`**: enable for training workflows that need `prompt_token_ids`, `generation_token_ids`, and `generation_log_probs`.

Multi-endpoint replicas (the `base_url: list[str]` pattern used with VLLMModel) do not apply to LocalVLLMModel: each LocalVLLMModel instance manages exactly one vLLM engine. To run multiple replicas, define each as a separate server instance in your config (chain configs in `config_paths`, or define multiple top-level keys in a single YAML).