Architecture#
This section describes how NeMo Gym components interact during startup and execution. For an overview of the three server types (Model, Resources, Agent), see Core Components.
Control Plane: Server Startup#
When you run ng_run, the system starts up in four phases:
%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333', 'primaryColor': '#e3f2fd', 'secondaryColor': '#f5f5f5'}}}%%
flowchart LR
A[Parse CLI] --> B[Load Configs]
B --> C[Init Ray]
C --> D[Start Servers]
Phase 1: Parse CLI#
The ng_run command uses Hydra to parse command-line arguments. Users specify configuration files via +config_paths:
ng_run "+config_paths=[resources_servers/math/configs/math.yaml, responses_api_models/openai_model/configs/openai_model.yaml]"
Phase 2: Load and Merge Configs#
Configuration is loaded from multiple sources in order of priority (later sources override earlier):
YAML files specified in
config_pathsLocal
env.yamlfile (for sensitive values like API keys)Command-line arguments (highest priority)
Port allocation: Users can explicitly specify host and port in their config. If not provided, the framework automatically allocates ports from available system ports, tracking used ports to prevent conflicts.
Phase 3: Initialize Ray#
The system initializes a Ray cluster for distributed coordination. If ray_head_node_address is specified in the config, it connects to an existing cluster; otherwise, it starts a new one.
Phase 4: Start Servers#
Servers are started in two stages:
%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
flowchart TB
Main["Main Process<br/>ng_run"]
Head["Head Server<br/>Port 11000"]
Model["Model Server<br/>Own venv + port"]
Agent["Agent Server<br/>Own venv + port"]
Resources["Resources Server<br/>Own venv + port"]
Main --> Head
Main -->|spawn| Model
Main -->|spawn| Agent
Main -->|spawn| Resources
Head Server: Started as a background thread in the main process. Provides endpoints for config discovery (
/global_config_dict_yaml) and server instance listing (/server_instances).Server Subprocesses: Each configured server is spawned as an independent OS process:
Each server has its own Python virtual environment in order to isolate dependencies.
Each runs uvicorn with a FastAPI application listening on
http://{host}:{port}.The global config is passed via environment variable
NEMO_GYM_CONFIG_DICT.The specific server identity is passed via
NEMO_GYM_CONFIG_PATH.Server URLs are registered in the global config, allowing other servers to discover and call them.
Health Check: The main process polls each server’s HTTP endpoint until all return 200, then reports “All servers ready!”
Running State#
Once all servers are healthy, the system enters steady state:
The main process sleeps and periodically polls subprocess health
Each server process runs its own uvicorn event loop, handling requests asynchronously
Servers communicate with each other only via HTTP (no shared memory)
Session state is maintained via cookies for multi-step rollouts
Shutdown#
When the user presses Ctrl+C (or the process receives SIGINT):
SIGINT is forwarded to all server subprocesses
Main process waits for subprocesses to terminate (with timeout)
Head server thread is stopped
Process exits cleanly
HTTP Request Flow: Example#
During a single rollout, servers communicate via HTTP. This example shows a math problem with tool use:
%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
sequenceDiagram
autonumber
participant Agent as Agent
participant Model as Model
participant Resources as Resources
Note over Agent,Resources: Initialize
Agent->>Resources: POST /seed_session
Resources-->>Agent: OK
Note over Agent,Resources: First generation
Agent->>Model: POST /v1/responses
Model-->>Agent: function_call: calculate
Note over Agent,Resources: Tool execution
Agent->>Resources: POST /calculate
Resources-->>Agent: "4"
Note over Agent,Resources: Second generation
Agent->>Model: POST /v1/responses
Model-->>Agent: message: "The answer is 4"
Note over Agent,Resources: Verification
Agent->>Resources: POST /verify
Resources-->>Agent: reward: 1.0
Key Design Points:
HTTP-only communication: All servers communicate via HTTP, enabling language-agnostic implementations and deployment flexibility
Stateless model servers: Model servers perform single-call generation without memory; the agent maintains conversation state
Session state in resources: Resources servers use session cookies to maintain per-rollout state across multiple tool calls
OpenAI API compatibility: Model servers expose
/v1/responsesendpoints compatible with the OpenAI Responses APIuvicorn + FastAPI: All servers use uvicorn as the ASGI server with FastAPI for HTTP routing and request handling
Data Plane: Rollout Collection#
When you run ng_collect_rollouts, the system collects training data by executing rollouts in parallel:
%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
sequenceDiagram
autonumber
participant Client as Client
participant Head as Head Server
participant Agent as Agent
participant Model as Model
participant Resources as Resources
Client->>Head: GET /global_config_dict_yaml
Head-->>Client: Server addresses
Client->>Agent: POST /run
loop Each prompt in parallel
Agent->>Resources: Seed session
Resources-->>Agent: OK
loop Generation loop
Agent->>Model: Generate response
Model-->>Agent: Output
opt Tool calls
Agent->>Resources: Execute tool
Resources-->>Agent: Result
end
end
Agent->>Resources: Verify and score
Resources-->>Agent: Reward
end
Agent-->>Client: Results with rewards
The client first queries the Head Server to discover server addresses from the global config, then reads input JSONL and dispatches prompts to the Agent. Completed rollouts are written to output JSONL.
Concurrency behavior differs by use case:
Standalone rollout collection (
ng_collect_rollouts): A semaphore gates concurrency vianum_samples_in_parallelto control load.Training framework integration (e.g., NeMo RL): All requests are sent without gating; the training framework manages concurrency externally.