vLLM is a popular LLM inference engine. The NeMo Gym VLLMModel server wraps vLLM’s Chat Completions endpoint and converts requests and responses to NeMo Gym’s native format, the OpenAI Responses API schema.
Most open-source models use Chat Completions format, while NeMo Gym uses the Responses API natively. VLLMModel bridges this gap by converting between the two formats automatically. For background on why NeMo Gym chose the Responses API and how the two schemas differ, see responses-api-evolution.
VLLMModel provides a Responses API to Chat Completions mapping middleware layer via responses_api_models/vllm_model. It assumes you are pointing to a vLLM instance since it relies on vLLM-specific endpoints like /tokenize and vLLM-specific arguments like return_tokens_as_token_ids.
To use VLLMModel, just change the responses_api_models/openai_model/configs/openai_model.yaml in your config paths to responses_api_models/vllm_model/configs/vllm_model.yaml!
Below is an e2e example of how to spin up a NeMo Gym compatible vLLM Chat Completions OpenAI server and run rollout collection with it.
Please run the steps below in a separate terminal than your NeMo Gym terminal! The installation will take a few minutes.
This download will take a few minutes.
If you get errors relating to HuggingFace rate limits, please provide your HF token to command above.
If you do not have a HuggingFace token, please follow the instructions here to create one!
vLLM server configuration
Qwen/Qwen3-4B-Thinking-2507, which is suggested to use the hermes tool call parser.Qwen/Qwen3-4B-Thinking-2507, which is suggested to use the deepseek_r1 reasoning parser.--tensor-parallel-size 1 which requires 1 GPU.The spinup step will take a few minutes.
In a second terminal on the same GPU node that was used to spin up the vLLM server, enter the NeMo Gym Python environment, and start the NeMo Gym servers.
If you want to run NeMo Gym on a separate machine from the one used to spin up the vLLM server, please get the hostname of the machine used to run the vLLM server.
Then replace the policy_base_url=http://0.0.0.0:10240/v1 to point to the hostname policy_base_url=http://{hostname}:10240/v1.
In a third terminal on the same GPU node that was used to spin up the vLLM server, enter the NeMo Gym Python environment, and run rollout collection.
chat_template_kwargsOverride chat template behavior for specific models:
extra_bodyPass vLLM-specific parameters not in the standard OpenAI API:
The vLLM model server supports multiple endpoints for horizontal scaling:
How it works:
When a conversation exceeds the vLLM model’s maximum context length (max_seq_length), VLLMModel handles the error gracefully instead of crashing the entire rollout collection.
"This model's maximum context length is 32768 tokens. However, you requested 32818 tokens...".finish_reason: "length".finish_reason: "length" is converted to incomplete_details: { reason: "max_output_tokens" } in the Responses API response returned to the agent.This is particularly important for multi-turn agentic rollouts where conversation length can grow unpredictably across tool-call turns.
Downstream consumers (agents, RL training frameworks) can check the incomplete_details field on the response:
When incomplete_details.reason == "max_output_tokens", the response output is empty because vLLM rejected the request before generation began. This differs from a normal max_output_tokens truncation where the model generates up to the token limit — in this case, the input itself was too long.
When using NeMo Gym with NeMo RL or another training framework, responses with incomplete_details.reason == "max_output_tokens" indicate that the full conversation (prompt + prior generations) exceeded max_seq_length. Training frameworks should filter or handle these responses appropriately since they contain no generated tokens.
By default, VLLMModel will not track any token IDs explicitly. However, token IDs are necessary when using NeMo Gym in conjunction with a training framework in order to train a model. For training workflows, use the training-dedicated config which enables token ID tracking:
This enables:
prompt_token_ids: Token IDs for the input promptgeneration_token_ids: Token IDs for generated textgeneration_log_probs: Log probabilities for each generated token