New Environment

View as Markdown

An environment consists of three components: Agents, Models, and Resources. Most contributors will create new resources servers while using existing agent server and model server implementations. If you need to create custom agents or models, you can reference the implementations in responses_api_agents/ and responses_api_models/.

This guide focuses on the resources server contribution process.

For a guide to building your first resources server, refer to Single Step Environment.

Guiding Principles

Adding a new training environment is similar to Adding A Benchmark with respect to the reward profiling and software correctness aspects. However, for adding a new training environment, you must additionally run training with your environment in isolation in order to determine the environment functionality and correctness.

The typical training flow that is expected is the GRPO RL algorithm using NeMo RL, with 64 prompts per step and 16 rollouts per prompt. This is a suggestion that can vary depending on the environment capability itself (for example, it can require higher rollouts per prompt) and available compute. We just need to design an experimental setup that will help us determine the causation relationship of the data generated by the training environment on improvements on targeted model capability.

Specifically, you should run training using a model that achieves meaningful performance on your reward profiling. And similar to the benchmark best practices, you should train using an instruct and thinking model to ensure infra correctness. The models can have different performances, but there should be no notable or unexplainable difference between the two types of models. The benchmark best practices includes a list of recommended models.

When making a training environment PR, include W&B links and screenshots of the train and validation curves.

Required Files

Your resources server must include these files:

FileDescription
app.pyMain server implementation with verify function
configs/*.yamlConfiguration with valid domain field
tests/test_app.pyAt least one unit test
data/example.jsonlAt least five example inputs
data/example_rollouts.jsonlPre-generated rollouts from example data (generate before submitting PR)
requirements.txtPython dependencies
README.mdDocumentation with licensing information

Contribution Workflow

Contributing a resources server follows this sequence:

StepPhaseDescription
1Curate TasksCollect or generate training tasks and create example data
2ImplementationBuild resources server with verification logic
3TestingWrite and run unit tests
4Example RolloutsGenerate example rollouts to verify functionality
5Reward ProfilingValidate reward distribution with inference runs
6Training ValidationTrain with GRPO to ensure meaningful training signal
7Submit PRSubmit pull request with all required information
8ReviewAddress reviewer feedback and verify reproducibility

Detailed Steps

1. Curate Training Tasks

Prepare the dataset for your environment:

  • Collect or generate prompts/tasks for your environment
  • Create data/example.jsonl with at least 5 representative task examples

2. Resources Server Implementation

Build your resources server:

  • Run ng_init_resources_server +entrypoint=resources_servers/my_server to scaffold the new resources server
  • Follow the Single Step Environment guide to implement your specific logic
  • Implement verification logic for your tasks by defining the verify() function
  • Set the domain field in your resources server configuration (see Domain).
  • Complete the auto-generated README.md with licensing information

3. Testing

Write and run tests for your resources server:

  • At least one test per server is required for PR approval
  • You are responsible for ensuring your tests adequately cover your server’s functionality

4. Generate Example Rollouts

Verify basic functionality and generate example rollouts:

  • Document the command used to start your server, for example, ng_run +entrypoint=resources_servers/my_server
  • Generate rollouts and save 5 example outputs to data/example_rollouts.jsonl to demonstrate correct reward signals

5. Reward Profiling

Run inference to validate reward distribution:

  • Use a ~500 sample subset (minimum)
  • Use Qwen3-4B, Qwen3 30B A3B, or equivalent model
  • Generate 16 responses per prompt
  • Report reward distribution
  • For tool calling: Provide tool call metrics and correlation with rewards

6. Training-Based Validation

Validate with actual training:

  • Train with GRPO on Qwen3-4B, Qwen 30B A3B Instruct, or equivalent model
  • Include training accuracy curve
  • Include test benchmark accuracy curve (if applicable)

7. Submit PR

Include the following in your pull request description:

  • Description of the environment
  • Description of the verification logic
  • Description of the prompts/tasks: What is the source? Which domain does it cover?
  • Provide relevant license information for data and software. If models were used for synthetic data generation, note this in your PR description

8. PR Review Process

After submitting your PR:

  1. A team member will be assigned to review and reproduce your environment
  2. The reviewer will verify all steps and check correctness of the 5 example rollouts
  3. The reviewer will re-run the procedure to ensure reproducibility
  4. Address any feedback from reviewers
  5. After approval, maintainers will merge your contribution

For optimal performance and scalability, we recommend following these design patterns:

Async-First Design

Endpoint handlers should be asynchronous to handle concurrent requests efficiently during training:

1# Recommended: async function
2async def verify(self, body: BaseVerifyRequest) -> BaseVerifyResponse:
3 return BaseVerifyResponse(**body.model_dump(), reward=1.0)

Avoid spawning additional threads or processes unless necessary. A single Gym instance can handle tens of thousands of concurrent requests when properly implemented.

NeMo Gym OpenAI Client

We recommend using the NeMo Gym OpenAI client from nemo_gym.openai_utils:

1from nemo_gym.openai_utils import (
2 NeMoGymAsyncOpenAI,
3 NeMoGymResponse,
4 NeMoGymResponseCreateParamsNonStreaming,
5)

The NeMo Gym client is optimized for scale and provides consistent behavior. External clients like LiteLLM often preprocess or postprocess inputs and outputs in ways that can interfere with training data collection.

Pydantic Models

Consider using Pydantic models for request and response validation by extending base classes from nemo_gym.base_resources_server:

1from pydantic import BaseModel
2from nemo_gym.base_resources_server import BaseVerifyRequest, BaseVerifyResponse
3
4class MyVerifyRequest(BaseVerifyRequest):
5 expected_result: str
6 difficulty: int

Error Handling

Tool execution errors should be propagated back to the model rather than crashing the server, enabling the model to learn from mistakes:

1async def execute_tool(self, path: str, body: ToolRequest) -> ToolResponse:
2 try:
3 result = self.tool_functions[path](**body.model_dump())
4 return ToolResponse(output=result)
5 except Exception as e:
6 # Return error to model so it can correct itself
7 return ToolResponse(output=f"Error executing tool '{path}': {str(e)}")

Configuration

Pass configuration through NeMo Gym config files rather than environment variables for better reproducibility:

1# configs/my_server.yaml
2host: 0.0.0.0
3port: 8000
4domain: agent

Multi-Step Rollouts

For multi-step scenarios, the model returns training information on response messages (prompt_token_ids, generation_token_ids, generation_log_probs). When constructing messages for subsequent model calls, propagate this information from previous responses to maintain the training data chain.

Reference