Environments for GRPO Training#
GRPO includes multiple environments, each offering a standard interface for reward computation and evaluation.
Math Environment#
The Math Environment is designed for mathematical reasoning tasks. It evaluates responses to math problems using math-verify and provides rewards based on correctness.
Key Features#
Evaluates mathematical reasoning
Supports multiple mathematical domains
Provides detailed feedback on solution correctness
Usage#
from nemo_rl.environments.math_environment import MathEnvironment
env_config = {
"num_workers": 2,
}
math_env = MathEnvironment.remote(env_config)
Multi-reward support#
To enable GDPO support, the math environment need to return each reward separately as:
rewards = torch.tensor(results).T.cpu() ## Shape Batch_size, Number of rewards
return EnvironmentReturn(
observations=observations,
metadata=metadata,
next_stop_strings=next_stop_strings,
rewards=rewards,
terminateds=done,
answers=extracted_answers,
)
Therefore, the return batch of run_multi_turn_rollout in rollouts.py would have extra entries to store each reward separately as:
# Add total rewards to the final batch
current_batch["total_reward"] = total_rewards
current_batch["truncated"] = sample_truncated
# Expose per-component rewards (reward1, reward2, ...) for multi-reward envs only; GRPO uses total_reward
if multi_rewards is not None:
num_reward_components = multi_rewards.shape[1]
for i in range(num_reward_components):
current_batch[f"reward{i + 1}"] = multi_rewards[:, i].clone()
Multi-reward support (GDPO)#
Environments can expose a single reward (standard GRPO) or multiple reward components (for GDPO).
Single-reward: Your env’s
stepreturnsrewardswith shape(batch_size,). The rollout stores onlytotal_reward. Usegrpo.adv_estimator.name: "grpo"(default).Multi-reward: Your env’s
stepreturnsrewardswith shape(batch_size, num_components)(e.g. one column per objective). The rollout storestotal_reward(sum across components) and per-component keysreward1,reward2, … so GDPO can compute per-component baselines and combine advantages.
Returning multi-reward from the environment
Return a 2D tensor of shape (batch_size, num_reward_components):
# rewards: shape (batch_size, num_reward_components), e.g. (N, 3) for three objectives
rewards = torch.tensor(results).T.cpu()
return EnvironmentReturn(
observations=observations,
metadata=metadata,
next_stop_strings=next_stop_strings,
rewards=rewards,
terminateds=done,
answers=extracted_answers,
)
How the rollout uses it
When the environment returns 2D rewards, run_multi_turn_rollout in rollouts.py keeps total_reward and also exposes each component as reward1, reward2, … in the batch. Single-reward envs do not get reward1 keys; only total_reward is stored:
# Add total rewards to the final batch
current_batch["total_reward"] = total_rewards
current_batch["truncated"] = sample_truncated
# Expose per-component rewards (reward1, reward2, ...) for multi-reward envs only; GRPO uses total_reward
if multi_rewards is not None:
num_reward_components = multi_rewards.shape[1]
for i in range(num_reward_components):
current_batch[f"reward{i + 1}"] = multi_rewards[:, i].clone()
For instance, when running examples/configs/gdpo_math_1B.yaml, reward1 maps to correctness_reward, reward2 to int_reward, and reward3 to format_reward. More details can be found in HFMultiRewardVerifyWorker. Users can also implement their own environments that support multi-reward GDPO training by following this example.
Code Environment#
The Code Environment is designed for code generation and execution tasks. It provides a sandboxed environment for executing Python code and evaluating the results.
Usage#
from nemo_rl.environments.code_environment import CodeEnvironment
env_config = {
"num_workers": 2,
"terminate_on_evaluation": True, # Terminate after code execution
}
code_env = CodeEnvironment.remote(env_config)
Configuration#
num_workers: Number of parallel workers for code executionterminate_on_evaluation: Whether to terminate after code execution (True for single-turn, False for multi-turn).
We are tracking an end-to-end example of this environment in #858. Add a 👍 to show your interest.
Code Jaccard Environment#
The Code Jaccard Environment evaluates code (or text) responses by measuring Jaccard-based similarity against ground-truth answers. This is a lightweight, text-similarity reward useful when an execution sandbox is unnecessary or unavailable.
How It Works#
Extracts the assistant’s response text from each conversation.
Computes a Jaccard similarity score between the response and ground truth:
Tokenizes both texts by whitespace, computes intersection/union, then applies a length ratio penalty.
Scores are in [0, 1]. Observations label responses as “aligned/misaligned” using a 0.5 threshold.
Returns:
observations: Environment feedback strings.
rewards: Tensor of similarity scores.
terminateds: All ones (single-step episodes).
answers: The response text when requested (optional).
Usage#
from nemo_rl.environments.code_jaccard_environment import CodeJaccardEnvironment
env_config = {
"num_workers": 2,
# Optional default stop strings (unused in scoring but available for consistency)
"stop_strings": None,
}
code_jaccard_env = CodeJaccardEnvironment.remote(env_config)
Configuration#
num_workers(int): Number of parallel verification workers.stop_strings(list[str] | None): Optional default stop strings (propagated downstream; not required for scoring).
Sample GRPO Config#
env:
code_jaccard:
num_workers: 2
stop_strings: null
data:
env_name: code_jaccard
Reward Model Environment#
The Reward Model Environment uses pre-trained reward models to score conversation quality.
Usage#
from nemo_rl.environments.reward_model_environment import RewardModelEnvironment
env_config = {
"enabled": True,
"model_name": "Skywork/Skywork-Reward-V2-Qwen3-0.6B",
"tokenizer": {"name": "Skywork/Skywork-Reward-V2-Qwen3-0.6B"},
"precision": "bfloat16",
"batch_size": 32,
"resources": {"gpus_per_node": 1, "num_nodes": 1},
"reward_model_cfg": {
"enabled": True,
"reward_model_type": "bradley_terry",
},
}
reward_env = RewardModelEnvironment.remote(env_config)
Resource Allocation in GRPO Training#
In GRPO training, resources are allocated across three main components:
Policy Actor: The trained model.
Generation Actor: Used for generating responses during rollouts (can be colocated with policy or on separate nodes/GPUs).
Reward Model Environment Actor: Evaluates generated responses and computes rewards.
The resource allocation logic works as follows:
Single-Node Setup (num_nodes: 1)#
All components share the same node
GPUs are divided between policy training, generation, and reward model
Example:
Policy and generation colocated: 8 GPUs total = 4 for colocated policy and generation + 4 for reward model
Policy and generation non-colocated: 8 GPUs total = 2 for policy + 2 for generation + 4 for reward model
Multi-Node Setup (num_nodes > 1)#
Policy training, generation, and reward model environment can be distributed across different nodes.
Reward model gets dedicated resources as specified in
env.reward_model.resources.Generation gets dedicated resources as specified in
policy.generation.colocated.resources.Remaining nodes are allocated to policy training.
In the future, the resource control part will be refactored to enable fine-grained resource configuration for each actor. For detailed resource management and optimization strategies, see #1100.
Complete GRPO Training with Reward Model Environments#
See examples/run_grpo.py with examples/configs/grpo_rm_1B.yaml for a complete example of using the reward model environment with GRPO training.
uv run examples/run_grpo.py --config examples/configs/grpo_rm_1B.yaml
Registering Custom Environments#
NeMo RL provides a flexible environment registration mechanism that allows you to add custom environments without modifying the source code.
Using the register_env Interface#
You can use the register_env function to dynamically register new environments without modifying NeMo RL’s internal code.
Function Signature
from nemo_rl.environments.utils import register_env
register_env(env_name: str, actor_class_fqn: str) -> None
Parameters:
env_name: Unique identifier name for the environment (string)actor_class_fqn: Fully Qualified Name of the environment Actor class, in the format'module.path.ClassName'
Example: Registering a Custom Environment#
Suppose you’ve created a custom reinforcement learning environment for code generation tasks:
1. Create Your Custom Environment Actor Class
# File: my_custom_envs/code_gen_env.py
import ray
from nemo_rl.environments.interfaces import EnvironmentInterface
@ray.remote
class CodeGenEnvironmentActor(EnvironmentInterface):
"""Custom code generation environment."""
def __init__(self, config):
self.config = config
# Initialize your environment
async def reset(self):
# Reset environment logic
return initial_state
async def step(self, action):
# Execute action, return reward, etc.
return observation, reward, done, info
# Implement other required interface methods...
2. Register the Environment in Your Training Script
# File: train.py
from nemo_rl.environments.utils import register_env
# Register your custom environment
register_env(
env_name="code_gen",
actor_class_fqn="my_custom_envs.code_gen_env.CodeGenEnvironmentActor"
)
# Now you can use "code_gen" in your config
# Training code...
3. Use the Registered Environment in Your Config
# config.yaml
env:
code_gen:
num_workers: 2
max_code_length: 512
test_cases_per_problem: 5
data:
env_name: code_gen # Use your registered environment name