NeMo Gym Integration#

This document describes how NeMo RL integrates with NeMo Gym for multi-step and multi-turn reinforcement learning training.

Overview#

NeMo Gym provides HTTP-based training environments for LLMs. NeMo Gym is CPU-only—it runs no inference engines and holds no GPU memory. NeMo RL exposes its vLLM generation engine as an OpenAI-compatible HTTP server, which NeMo Gym calls during rollouts, enabling:

Decoupled architecture: Environments don’t need direct access to model internals
Multi-step/multi-turn support: Agents can orchestrate complex interactions with tools
Refit compatibility: NeMo RL’s weight synchronization works transparently

Configuration#

To enable NeMo Gym integration, add the following to your NeMo RL config:

policy:
  generation:
    backend: vllm
    vllm_cfg:
      async_engine: true          # Both required for HTTP server support:
      expose_http_server: true    # async_engine enables the async worker; expose_http_server starts the server

env:
  should_use_nemo_gym: true       # Enables NeMo Gym integration
  nemo_gym:
    # NeMo Gym config paths and settings
    config_paths:
      - resources_servers/math/configs/math.yaml
      - responses_api_agents/simple_agent/configs/simple_agent.yaml

For a complete example, see examples/nemo_gym/ and its associated configs.

Version Requirements#

NeMo Gym runs as a Ray actor within NeMo RL’s Ray cluster, so the same Ray and Python versions must be used in both environments.

Architecture Overview#

        %%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
flowchart LR
    subgraph RL["NeMo RL"]
        GRPO["GRPO Loop"]
        vLLM["vLLM + HTTP"]
        Bridge["NemoGym Actor"]
    end
    
    subgraph Gym["NeMo Gym"]
        Agent["Agent"]
        Model["Model (Proxy)"]
        Resources["Resources"]
    end
    
    GRPO -->|refit| vLLM
    GRPO -->|run_rollouts| Bridge
    Bridge -->|spawns| Gym
    Agent <--> Model
    Agent <--> Resources
    Model -->|HTTP| vLLM

    style RL fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
    style Gym fill:#fff3e0,stroke:#ef6c00,stroke-width:2px

Color coding:

Blue = NeMo RL code (nemo_rl/)
Orange = NeMo Gym code (3rdparty/Gym-workspace/Gym/nemo_gym/)

The NemoGym Actor#

The integration is handled by the NemoGym Ray actor at nemo_rl/environments/nemo_gym.py:

Created by NeMo RL during training setup via NemoGym.remote(config)
Joins the existing Ray cluster that NeMo RL already initialized
Spawns NeMo Gym servers as OS subprocesses (Head, Agent, Model, Resources)
Injects vLLM base URLs so NeMo Gym’s Model Server knows where to proxy requests
Exposes run_rollouts() as the entry point for the training loop to call

        %%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
flowchart LR
    subgraph RL["NeMo RL"]
        GRPO["GRPO Loop"]
        Actor["NemoGym Actor"]
    end
    
    subgraph Gym["NeMo Gym"]
        RCH["RolloutCollectionHelper"]
        Agent["Agent Server"]
    end
    
    GRPO --> Actor
    Actor --> Agent
    Agent --> RCH
    RCH --> Actor
    Actor --> GRPO

    style RL fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
    style Gym fill:#fff3e0,stroke:#ef6c00,stroke-width:2px

The flow is:

GRPO Loop calls run_rollouts.remote(batch) on the NemoGym Actor
Actor sends POST /run to the Agent Server
Agent Server orchestrates the rollout via RolloutCollectionHelper
Results return to the Actor
Actor returns results to the training loop

vLLM HTTP Server#

NeMo Gym does not run its own vLLM engine. The Model Server is purely an HTTP proxy:

Aspect	NeMo RL vLLM Worker	NeMo Gym Model Server
Engine	Runs actual vLLM `AsyncLLM`	No engine - HTTP proxy only
GPU	Holds model weights	No GPU required
Endpoints	`/v1/chat/completions`, `/tokenize`	`/v1/responses`
Role	Inference	API translation, forwards requests

Data parallel vLLM workers each expose their own HTTP server. NeMo Gym’s Model Server load-balances requests across them.

Initialization Sequence#

        %%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
sequenceDiagram
    autonumber
    box rgb(227, 242, 253) NeMo RL
        participant RL as Training Script
        participant Ray as Ray Cluster
        participant vLLM as vLLM Workers
        participant Bridge as NemoGym Actor
    end
    box rgb(255, 243, 224) NeMo Gym
        participant Servers as NeMo Gym Servers
    end
    
    RL->>Ray: Initialize Ray cluster
    RL->>vLLM: Create vLLM workers with HTTP servers
    vLLM-->>RL: Return base URLs (one per DP rank)
    RL->>Bridge: NemoGym.remote(config, base_urls)
    Note over Bridge: Reuses existing Ray cluster
    Bridge->>Servers: Spawn subprocess servers
    Servers-->>Bridge: Health check OK
    Bridge-->>RL: Ready for rollouts

Training Loop Control Flow#

        %%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
sequenceDiagram
    autonumber
    box rgb(227, 242, 253) NeMo RL
        participant GRPO as GRPO Loop
        participant Policy as Policy Workers
        participant vLLM as vLLM HTTP
        participant Bridge as NemoGym Actor
    end
    box rgb(255, 243, 224) NeMo Gym
        participant Agent as Agent Server
        participant Model as Model Server
        participant Resource as Resource Server
    end
    
    GRPO->>Policy: Refit (trigger weight sync)
    Policy->>vLLM: Sync weights to vLLM
    GRPO->>Bridge: run_rollouts.remote(batch)
    Bridge->>Agent: POST /run
    Agent->>Model: POST /v1/responses
    Model->>vLLM: POST /v1/chat/completions
    vLLM-->>Model: Response
    Model-->>Agent: Responses API format
    Agent->>Resource: Execute tool / compute reward
    Resource-->>Agent: Tool result / reward
    Agent-->>Bridge: Results + rewards
    Bridge-->>GRPO: Token IDs, logprobs, rewards
    GRPO->>Policy: Compute loss and train

NeMo Gym server types (see Core Components):

Agent Server: Orchestrates the rollout loop

Model Server: HTTP proxy to vLLM; translates Responses API ↔ Chat Completions

Resource Server: Provides tools and rewards

Key Steps#

Step	Location	Description
Refit	NeMo RL	Synchronizes policy weights to vLLM workers. For async RL, refit timing may differ—see Generation Interface for details.
run_rollouts.remote()	NeMo RL	Ray remote call from GRPO loop to the NemoGym actor
POST /run	NeMo RL → NeMo Gym	HTTP request from NemoGym actor to Agent Server subprocess
Rollout orchestration	NeMo Gym	Agent calls Model Server and Resources Server via HTTP
POST /v1/chat/completions	NeMo Gym → NeMo RL	Model Server proxies to NeMo RL’s vLLM HTTP endpoint
Result processing	NeMo RL	NemoGym actor extracts token IDs, logprobs, rewards

Async Result Processing#

The NemoGym actor uses an as-completed pattern to overlap waiting with post-processing:

Results return out of order: Single steps of the rollouts (the “assistant” + “tool” turns) complete at different times depending on conversation length and tool calls. Rather than waiting for all results, the actor processes each result as soon as it completes. Note: this is pipelining within NeMo Gym, not asynchronous processing of global batch steps by NeMo RL.
Immediate post-processing: As each rollout completes, the actor immediately extracts token IDs and logprobs. This overlaps CPU work with network I/O from slower rollouts still in flight.
Reordering at the end: Each example carries an index. After all results are collected, results are reordered to match the original batch order before returning to the training loop.

This pattern maximizes throughput by keeping the CPU busy while waiting for network responses.

Data Format Translation#

        %%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
flowchart LR
    subgraph RL1["NeMo RL Input"]
        Datum["DatumSpec"]
    end
    
    subgraph Gym["NeMo Gym"]
        Example["Example Dict"]
        ReqResp["Responses API"]
        ReqChat["Chat Completions"]
    end
    
    subgraph RL2["NeMo RL Output"]
        Result["Result"]
    end
    
    Datum --> Example
    Example --> ReqResp
    ReqResp --> ReqChat
    ReqChat --> ReqResp
    ReqResp --> Example
    Example --> Result

    style RL1 fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
    style RL2 fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
    style Gym fill:#fff3e0,stroke:#ef6c00,stroke-width:2px

Formats:

DatumSpec (NeMo RL): Training-focused format with prompt, prompt_token_ids, and task metadata
Example Dict (NeMo Gym): Environment-focused format containing responses_create_params and expected answer
Responses API (NeMo Gym): OpenAI Responses API format with input, tools, and multi-turn conversation
Chat Completions (vLLM): OpenAI Chat Completions format for the actual inference call

Data flow: DatumSpec is converted to Example Dict, which passes through to the Responses API with generation parameters (temperature, top_p) added for on-policy sampling. The Model Server translates Responses API ↔ Chat Completions (converting message formats, extracting reasoning content, attaching token IDs). Results flow back with token IDs and logprobs extracted into the final Result.

Tokenization and On-Policy Corrections#

Token IDs are extracted at the NeMo RL vLLM layer via the /tokenize endpoint. This ensures:

Tokenization matches the exact model and tokenizer used for generation
No re-tokenization drift between generation and training

For details on on-policy token ID handling, see Environments for GRPO Training and the NeMo Gym on-policy corrections documentation.