An environment consists of three components: Agents, Models, and Resources. Most contributors will create new resources servers while using existing agent server and model server implementations. If you need to create custom agents or models, you can reference the implementations in responses_api_agents/ and responses_api_models/.
This guide focuses on the resources server contribution process.
For a guide to building your first resources server, refer to Single Step Environment.
Adding a new training environment is similar to Adding A Benchmark with respect to the reward profiling and software correctness aspects. However, for adding a new training environment, you must additionally run training with your environment in isolation in order to determine the environment functionality and correctness.
The typical training flow that is expected is the GRPO RL algorithm using NeMo RL, with 64 prompts per step and 16 rollouts per prompt. This is a suggestion that can vary depending on the environment capability itself (for example, it can require higher rollouts per prompt) and available compute. We just need to design an experimental setup that will help us determine the causation relationship of the data generated by the training environment on improvements on targeted model capability.
Specifically, you should run training using a model that achieves meaningful performance on your reward profiling. And similar to the benchmark best practices, you should train using an instruct and thinking model to ensure infra correctness. The models can have different performances, but there should be no notable or unexplainable difference between the two types of models. The benchmark best practices includes a list of recommended models.
When making a training environment PR, include W&B links and screenshots of the train and validation curves.
Your resources server must include these files:
Contributing a resources server follows this sequence:
Prepare the dataset for your environment:
data/example.jsonl with at least 5 representative task examplesBuild your resources server:
ng_init_resources_server +entrypoint=resources_servers/my_server to scaffold the new resources serververify() functiondomain field in your resources server configuration (see Domain).README.md with licensing informationWrite and run tests for your resources server:
Verify basic functionality and generate example rollouts:
ng_run +entrypoint=resources_servers/my_serverdata/example_rollouts.jsonl to demonstrate correct reward signalsRun inference to validate reward distribution:
Validate with actual training:
Include the following in your pull request description:
After submitting your PR:
For optimal performance and scalability, we recommend following these design patterns:
Endpoint handlers should be asynchronous to handle concurrent requests efficiently during training:
Avoid spawning additional threads or processes unless necessary. A single Gym instance can handle tens of thousands of concurrent requests when properly implemented.
We recommend using the NeMo Gym OpenAI client from nemo_gym.openai_utils:
The NeMo Gym client is optimized for scale and provides consistent behavior. External clients like LiteLLM often preprocess or postprocess inputs and outputs in ways that can interfere with training data collection.
Consider using Pydantic models for request and response validation by extending base classes from nemo_gym.base_resources_server:
Tool execution errors should be propagated back to the model rather than crashing the server, enabling the model to learn from mistakes:
Pass configuration through NeMo Gym config files rather than environment variables for better reproducibility:
For multi-step scenarios, the model returns training information on response messages (prompt_token_ids, generation_token_ids, generation_log_probs). When constructing messages for subsequent model calls, propagate this information from previous responses to maintain the training data chain.