> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/gym/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/gym/llms-full.txt.

# vLLM

[vLLM](https://docs.vllm.ai/) is a popular LLM inference engine. The NeMo Gym VLLMModel server wraps vLLM's Chat Completions endpoint and converts requests and responses to NeMo Gym's native format, the OpenAI [Responses API](https://platform.openai.com/docs/api-reference/responses) schema.

Most open-source models use Chat Completions format, while NeMo Gym uses the Responses API natively. VLLMModel bridges this gap by converting between the two formats automatically. For background on why NeMo Gym chose the Responses API and how the two schemas differ, see [responses-api-evolution](/latest/infrastructure/engineering-notes/responses-api-evolution).

VLLMModel provides a Responses API to Chat Completions mapping middleware layer via `responses_api_models/vllm_model`. It assumes you are pointing to a vLLM instance since it relies on vLLM-specific endpoints like `/tokenize` and vLLM-specific arguments like `return_tokens_as_token_ids`.

**To use VLLMModel, just change the `responses_api_models/openai_model/configs/openai_model.yaml` in your config paths to `responses_api_models/vllm_model/configs/vllm_model.yaml`!**

```bash
config_paths="resources_servers/example_multi_step/configs/example_multi_step.yaml,\
responses_api_models/vllm_model/configs/vllm_model.yaml"
ng_run "+config_paths=[$config_paths]"
```

## Use VLLMModel

Below is an e2e example of how to spin up a NeMo Gym compatible vLLM Chat Completions OpenAI server and run rollout collection with it.

### Install vLLM

Please run the steps below in a separate terminal than your NeMo Gym terminal! The installation will take a few minutes.

```bash
uv venv --python 3.12 --seed .venv
source .venv/bin/activate
# hf_transfer for faster model download
uv pip install hf_transfer vllm --torch-backend=auto
```

### Download the model

This download will take a few minutes.

```bash
# Qwen/Qwen3-4B-Thinking-2507, usable in Nemo RL!
HF_HOME=.cache/ \
HF_HUB_ENABLE_HF_TRANSFER=1 \
    hf download Qwen/Qwen3-4B-Thinking-2507
```

<Tip>
  If you get errors relating to HuggingFace rate limits, please provide your HF token to command above.

  ```bash
  HF_TOKEN=... \
  HF_HOME=.cache/ \
  HF_HUB_ENABLE_HF_TRANSFER=1 \
      hf download Qwen/Qwen3-4B-Thinking-2507
  ```

  If you do not have a HuggingFace token, please follow the instructions [here](https://huggingface.co/docs/hub/en/security-tokens) to create one!
</Tip>

### Spin up a vLLM server

vLLM server configuration

* If you want to use tools, find the appropriate vLLM arguments regarding the tool call parser to use. In this example, we use `Qwen/Qwen3-4B-Thinking-2507`, which is suggested to use the `hermes` tool call parser.
* If you are using a reasoning model, find the appropriate vLLM arguments regarding reasoning parser to use. In this example, we use `Qwen/Qwen3-4B-Thinking-2507`, which is suggested to use the `deepseek_r1` reasoning parser.
* The example below uses `--tensor-parallel-size 1` which requires 1 GPU.

The spinup step will take a few minutes.

```bash
HF_HOME=.cache/ \
HOME=. \
vllm serve \
    Qwen/Qwen3-4B-Thinking-2507 \
    --tensor-parallel-size 4 \
    --enable-auto-tool-choice --tool-call-parser hermes \
    --reasoning-parser deepseek_r1 \
    --host 0.0.0.0 \
    --port 10240
```

### Configure NeMo Gym to use the local vLLM server

In a second terminal on the same GPU node that was used to spin up the vLLM server, enter the NeMo Gym Python environment, and start the NeMo Gym servers.

```bash
config_paths="resources_servers/example_multi_step/configs/example_multi_step.yaml,\
responses_api_models/vllm_model/configs/vllm_model.yaml"
ng_run "+config_paths=[$config_paths]" \
    ++policy_base_url=http://0.0.0.0:10240/v1 \
    ++policy_model_name=Qwen/Qwen3-4B-Thinking-2507 \
    ++policy_api_key=dummy_key
```

<Tip>
  If you want to run NeMo Gym on a separate machine from the one used to spin up the vLLM server, please get the hostname of the machine used to run the vLLM server.

  ```bash
  hostname -i
  ```

  Then replace the `policy_base_url=http://0.0.0.0:10240/v1` to point to the hostname `policy_base_url=http://{hostname}:10240/v1`.
</Tip>

### Run rollout collection

In a third terminal on the same GPU node that was used to spin up the vLLM server, enter the NeMo Gym Python environment, and run rollout collection.

```bash
ng_collect_rollouts +agent_name=example_multi_step_simple_agent \
    +input_jsonl_fpath=resources_servers/example_multi_step/data/example.jsonl \
    +output_jsonl_fpath=results/example_multi_step_rollouts.jsonl
```

## VLLMModel configuration reference

| Parameter                            | Type                 | Default | Description                                                                                 |
| ------------------------------------ | -------------------- | ------- | ------------------------------------------------------------------------------------------- |
| `base_url`                           | `str` or `list[str]` | —       | **Required.** vLLM server endpoint(s). Supports list for load balancing.                    |
| `api_key`                            | `str`                | —       | **Required.** API key matching vLLM's `--api-key` flag.                                     |
| `model`                              | `str`                | —       | **Required.** Model name as registered in vLLM.                                             |
| `return_token_id_information`        | `bool`               | —       | **Required.** Set `true` for training (token IDs + log probs), `false` for inference only.  |
| `uses_reasoning_parser`              | `bool`               | —       | **Required.** Set `true` for reasoning models (extracts `<think>` tags), `false` otherwise. |
| `replace_developer_role_with_system` | `bool`               | `false` | Convert "developer" role to "system" for models that don't support developer role.          |
| `chat_template_kwargs`               | `dict`               | `null`  | Override chat template parameters (e.g., `add_generation_prompt`).                          |
| `extra_body`                         | `dict`               | `null`  | Pass additional vLLM-specific parameters (e.g., `guided_json`).                             |

### Advanced: `chat_template_kwargs`

Override chat template behavior for specific models:

```yaml
chat_template_kwargs:
  enable_thinking: false  # Model-specific
```

### Advanced: `extra_body`

Pass vLLM-specific parameters not in the standard OpenAI API:

```yaml
extra_body:
  guided_json: '{"type": "object", "properties": {...}}'
  min_tokens: 10
  repetition_penalty: 1.1
```

## Use VLLMModel with multiple replicas of a model endpoint

The vLLM model server supports multiple endpoints for horizontal scaling:

```yaml
base_url:
  - http://gpu-node-1:8000/v1
  - http://gpu-node-2:8000/v1
  - http://gpu-node-3:8000/v1
```

**How it works**:

1. **Initial assignment**: New sessions are assigned to endpoints using round-robin (session 1 → endpoint 1, session 2 → endpoint 2, etc.)
2. **Session affinity**: Once assigned, a session always uses the same endpoint (tracked via HTTP session cookies)
3. **Why affinity?** Multi-turn conversations and agentic workflows that call the model multiple times in one trajectory need to hit the same model endpoint in order to hit the prefix cache, which significantly speeds up the prefill phase of model inference.

## Context Length Exceeded Handling

When a conversation exceeds the vLLM model's maximum context length (`max_seq_length`), VLLMModel handles the error gracefully instead of crashing the entire rollout collection.

### How it works

1. **vLLM rejects the request**: vLLM returns an HTTP 400 error with a message like `"This model's maximum context length is 32768 tokens. However, you requested 32818 tokens..."`.
2. **VLLMModel catches the error**: Instead of propagating the exception, VLLMModel returns an empty response with `finish_reason: "length"`.
3. **Responses API mapping**: The `finish_reason: "length"` is converted to `incomplete_details: { reason: "max_output_tokens" }` in the Responses API response returned to the agent.

This is particularly important for multi-turn agentic rollouts where conversation length can grow unpredictably across tool-call turns.

### How to detect truncated responses

Downstream consumers (agents, RL training frameworks) can check the `incomplete_details` field on the response:

```python
response = await model_server.responses(request, body)

if response.incomplete_details and response.incomplete_details["reason"] == "max_output_tokens":
    # The conversation exceeded the model's context window.
    # The response output will be empty — handle accordingly.
    pass
```

<Note>
  When `incomplete_details.reason == "max_output_tokens"`, the response output is empty because vLLM rejected the request before generation began. This differs from a normal `max_output_tokens` truncation where the model generates up to the token limit — in this case, the input itself was too long.
</Note>

### Implications for training

When using NeMo Gym with NeMo RL or another training framework, responses with `incomplete_details.reason == "max_output_tokens"` indicate that the full conversation (prompt + prior generations) exceeded `max_seq_length`. Training frameworks should filter or handle these responses appropriately since they contain no generated tokens.

## Training vs Offline Inference

By default, VLLMModel will not track any token IDs explicitly. However, token IDs are necessary when using NeMo Gym in conjunction with a training framework in order to train a model. For training workflows, use the training-dedicated config which enables token ID tracking:

```yaml
# Use vllm_model_for_training.yaml
return_token_id_information: true
```

This enables:

* `prompt_token_ids`: Token IDs for the input prompt
* `generation_token_ids`: Token IDs for generated text
* `generation_log_probs`: Log probabilities for each generated token