NeMo Gym Integration#
This document describes how NeMo RL integrates with NeMo Gym for multi-step and multi-turn reinforcement learning training.
Overview#
NeMo Gym provides HTTP-based training environments for LLMs. NeMo Gym is CPU-only—it runs no inference engines and holds no GPU memory. NeMo RL exposes its vLLM generation engine as an OpenAI-compatible HTTP server, which NeMo Gym calls during rollouts, enabling:
Decoupled architecture: Environments don’t need direct access to model internals
Multi-step/multi-turn support: Agents can orchestrate complex interactions with tools
Refit compatibility: NeMo RL’s weight synchronization works transparently
Configuration#
To enable NeMo Gym integration, add the following to your NeMo RL config:
policy:
generation:
backend: vllm
vllm_cfg:
async_engine: true # Both required for HTTP server support:
expose_http_server: true # async_engine enables the async worker; expose_http_server starts the server
env:
should_use_nemo_gym: true # Enables NeMo Gym integration
nemo_gym:
# NeMo Gym config paths and settings
config_paths:
- resources_servers/math/configs/math.yaml
- responses_api_agents/simple_agent/configs/simple_agent.yaml
For a complete example, see examples/nemo_gym/ and its associated configs.
Version Requirements#
NeMo Gym runs as a Ray actor within NeMo RL’s Ray cluster, so the same Ray and Python versions must be used in both environments.
Architecture Overview#
%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
flowchart LR
subgraph RL["NeMo RL"]
GRPO["GRPO Loop"]
vLLM["vLLM + HTTP"]
Bridge["NemoGym Actor"]
end
subgraph Gym["NeMo Gym"]
Agent["Agent"]
Model["Model (Proxy)"]
Resources["Resources"]
end
GRPO -->|refit| vLLM
GRPO -->|run_rollouts| Bridge
Bridge -->|spawns| Gym
Agent <--> Model
Agent <--> Resources
Model -->|HTTP| vLLM
style RL fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
style Gym fill:#fff3e0,stroke:#ef6c00,stroke-width:2px
Color coding:
Blue = NeMo RL code (
nemo_rl/)Orange = NeMo Gym code (
3rdparty/Gym-workspace/Gym/nemo_gym/)
The NemoGym Actor#
The integration is handled by the NemoGym Ray actor at nemo_rl/environments/nemo_gym.py:
Created by NeMo RL during training setup via
NemoGym.remote(config)Joins the existing Ray cluster that NeMo RL already initialized
Spawns NeMo Gym servers as OS subprocesses (Head, Agent, Model, Resources)
Injects vLLM base URLs so NeMo Gym’s Model Server knows where to proxy requests
Exposes
run_rollouts()as the entry point for the training loop to call
%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
flowchart LR
subgraph RL["NeMo RL"]
GRPO["GRPO Loop"]
Actor["NemoGym Actor"]
end
subgraph Gym["NeMo Gym"]
RCH["RolloutCollectionHelper"]
Agent["Agent Server"]
end
GRPO --> Actor
Actor --> Agent
Agent --> RCH
RCH --> Actor
Actor --> GRPO
style RL fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
style Gym fill:#fff3e0,stroke:#ef6c00,stroke-width:2px
The flow is:
GRPO Loop calls
run_rollouts.remote(batch)on the NemoGym ActorActor sends
POST /runto the Agent ServerAgent Server orchestrates the rollout via RolloutCollectionHelper
Results return to the Actor
Actor returns results to the training loop
vLLM HTTP Server#
NeMo Gym does not run its own vLLM engine. The Model Server is purely an HTTP proxy:
Aspect |
NeMo RL vLLM Worker |
NeMo Gym Model Server |
|---|---|---|
Engine |
Runs actual vLLM |
No engine - HTTP proxy only |
GPU |
Holds model weights |
No GPU required |
Endpoints |
|
|
Role |
Inference |
API translation, forwards requests |
Data parallel vLLM workers each expose their own HTTP server. NeMo Gym’s Model Server load-balances requests across them.
Initialization Sequence#
%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
sequenceDiagram
autonumber
box rgb(227, 242, 253) NeMo RL
participant RL as Training Script
participant Ray as Ray Cluster
participant vLLM as vLLM Workers
participant Bridge as NemoGym Actor
end
box rgb(255, 243, 224) NeMo Gym
participant Servers as NeMo Gym Servers
end
RL->>Ray: Initialize Ray cluster
RL->>vLLM: Create vLLM workers with HTTP servers
vLLM-->>RL: Return base URLs (one per DP rank)
RL->>Bridge: NemoGym.remote(config, base_urls)
Note over Bridge: Reuses existing Ray cluster
Bridge->>Servers: Spawn subprocess servers
Servers-->>Bridge: Health check OK
Bridge-->>RL: Ready for rollouts
Training Loop Control Flow#
%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
sequenceDiagram
autonumber
box rgb(227, 242, 253) NeMo RL
participant GRPO as GRPO Loop
participant Policy as Policy Workers
participant vLLM as vLLM HTTP
participant Bridge as NemoGym Actor
end
box rgb(255, 243, 224) NeMo Gym
participant Agent as Agent Server
participant Model as Model Server
participant Resource as Resource Server
end
GRPO->>Policy: Refit (trigger weight sync)
Policy->>vLLM: Sync weights to vLLM
GRPO->>Bridge: run_rollouts.remote(batch)
Bridge->>Agent: POST /run
Agent->>Model: POST /v1/responses
Model->>vLLM: POST /v1/chat/completions
vLLM-->>Model: Response
Model-->>Agent: Responses API format
Agent->>Resource: Execute tool / compute reward
Resource-->>Agent: Tool result / reward
Agent-->>Bridge: Results + rewards
Bridge-->>GRPO: Token IDs, logprobs, rewards
GRPO->>Policy: Compute loss and train
NeMo Gym server types (see Core Components):
Agent Server: Orchestrates the rollout loop
Model Server: HTTP proxy to vLLM; translates Responses API ↔ Chat Completions
Resource Server: Provides tools and rewards
Key Steps#
Step |
Location |
Description |
|---|---|---|
Refit |
NeMo RL |
Synchronizes policy weights to vLLM workers. For async RL, refit timing may differ—see Generation Interface for details. |
run_rollouts.remote() |
NeMo RL |
Ray remote call from GRPO loop to the NemoGym actor |
POST /run |
NeMo RL → NeMo Gym |
HTTP request from NemoGym actor to Agent Server subprocess |
Rollout orchestration |
NeMo Gym |
Agent calls Model Server and Resources Server via HTTP |
POST /v1/chat/completions |
NeMo Gym → NeMo RL |
Model Server proxies to NeMo RL’s vLLM HTTP endpoint |
Result processing |
NeMo RL |
NemoGym actor extracts token IDs, logprobs, rewards |
Async Result Processing#
The NemoGym actor uses an as-completed pattern to overlap waiting with post-processing:
Results return out of order: Single steps of the rollouts (the “assistant” + “tool” turns) complete at different times depending on conversation length and tool calls. Rather than waiting for all results, the actor processes each result as soon as it completes. Note: this is pipelining within NeMo Gym, not asynchronous processing of global batch steps by NeMo RL.
Immediate post-processing: As each rollout completes, the actor immediately extracts token IDs and logprobs. This overlaps CPU work with network I/O from slower rollouts still in flight.
Reordering at the end: Each example carries an index. After all results are collected, results are reordered to match the original batch order before returning to the training loop.
This pattern maximizes throughput by keeping the CPU busy while waiting for network responses.
Data Format Translation#
%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
flowchart LR
subgraph RL1["NeMo RL Input"]
Datum["DatumSpec"]
end
subgraph Gym["NeMo Gym"]
Example["Example Dict"]
ReqResp["Responses API"]
ReqChat["Chat Completions"]
end
subgraph RL2["NeMo RL Output"]
Result["Result"]
end
Datum --> Example
Example --> ReqResp
ReqResp --> ReqChat
ReqChat --> ReqResp
ReqResp --> Example
Example --> Result
style RL1 fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
style RL2 fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
style Gym fill:#fff3e0,stroke:#ef6c00,stroke-width:2px
Formats:
DatumSpec (NeMo RL): Training-focused format with
prompt,prompt_token_ids, and task metadataExample Dict (NeMo Gym): Environment-focused format containing
responses_create_paramsandexpectedanswerResponses API (NeMo Gym): OpenAI Responses API format with
input,tools, and multi-turn conversationChat Completions (vLLM): OpenAI Chat Completions format for the actual inference call
Data flow: DatumSpec is converted to Example Dict, which passes through to the Responses API with generation parameters (temperature, top_p) added for on-policy sampling. The Model Server translates Responses API ↔ Chat Completions (converting message formats, extracting reasoning content, attaching token IDs). Results flow back with token IDs and logprobs extracted into the final Result.
Tokenization and On-Policy Corrections#
Token IDs are extracted at the NeMo RL vLLM layer via the /tokenize endpoint. This ensures:
Tokenization matches the exact model and tokenizer used for generation
No re-tokenization drift between generation and training
For details on on-policy token ID handling, see Environments for GRPO Training and the NeMo Gym on-policy corrections documentation.