Environments for GRPO Training#

GRPO includes multiple environments, each offering a standard interface for reward computation and evaluation.

Math Environment#

The Math Environment is designed for mathematical reasoning tasks. It evaluates responses to math problems using math-verify and provides rewards based on correctness.

Key Features#

  • Evaluates mathematical reasoning

  • Supports multiple mathematical domains

  • Provides detailed feedback on solution correctness

Usage#

from nemo_rl.environments.math_environment import MathEnvironment

env_config = {
    "num_workers": 2,
}

math_env = MathEnvironment.remote(env_config)

Multi-reward support#

To enable GDPO support, the math environment need to return each reward separately as:

rewards = torch.tensor(results).T.cpu()  ## Shape Batch_size, Number of rewards

return EnvironmentReturn(
    observations=observations,
    metadata=metadata,
    next_stop_strings=next_stop_strings,
    rewards=rewards,
    terminateds=done,
    answers=extracted_answers,
)

Therefore, the return batch of run_multi_turn_rollout in rollouts.py would have extra entries to store each reward separately as:

# Add total rewards to the final batch
current_batch["total_reward"] = total_rewards
current_batch["truncated"] = sample_truncated
# Expose per-component rewards (reward1, reward2, ...) for multi-reward envs only; GRPO uses total_reward
if multi_rewards is not None:
    num_reward_components = multi_rewards.shape[1]
    for i in range(num_reward_components):
        current_batch[f"reward{i + 1}"] = multi_rewards[:, i].clone()

Multi-reward support (GDPO)#

Environments can expose a single reward (standard GRPO) or multiple reward components (for GDPO).

  • Single-reward: Your env’s step returns rewards with shape (batch_size,). The rollout stores only total_reward. Use grpo.adv_estimator.name: "grpo" (default).

  • Multi-reward: Your env’s step returns rewards with shape (batch_size, num_components) (e.g. one column per objective). The rollout stores total_reward (sum across components) and per-component keys reward1, reward2, … so GDPO can compute per-component baselines and combine advantages.

Returning multi-reward from the environment

Return a 2D tensor of shape (batch_size, num_reward_components):

# rewards: shape (batch_size, num_reward_components), e.g. (N, 3) for three objectives
rewards = torch.tensor(results).T.cpu()

return EnvironmentReturn(
    observations=observations,
    metadata=metadata,
    next_stop_strings=next_stop_strings,
    rewards=rewards,
    terminateds=done,
    answers=extracted_answers,
)

How the rollout uses it

When the environment returns 2D rewards, run_multi_turn_rollout in rollouts.py keeps total_reward and also exposes each component as reward1, reward2, … in the batch. Single-reward envs do not get reward1 keys; only total_reward is stored:

# Add total rewards to the final batch
current_batch["total_reward"] = total_rewards
current_batch["truncated"] = sample_truncated
# Expose per-component rewards (reward1, reward2, ...) for multi-reward envs only; GRPO uses total_reward
if multi_rewards is not None:
    num_reward_components = multi_rewards.shape[1]
    for i in range(num_reward_components):
        current_batch[f"reward{i + 1}"] = multi_rewards[:, i].clone()

For instance, when running examples/configs/gdpo_math_1B.yaml, reward1 maps to correctness_reward, reward2 to int_reward, and reward3 to format_reward. More details can be found in HFMultiRewardVerifyWorker. Users can also implement their own environments that support multi-reward GDPO training by following this example.

Code Environment#

The Code Environment is designed for code generation and execution tasks. It provides a sandboxed environment for executing Python code and evaluating the results.

Usage#

from nemo_rl.environments.code_environment import CodeEnvironment

env_config = {
    "num_workers": 2,
    "terminate_on_evaluation": True,  # Terminate after code execution
}

code_env = CodeEnvironment.remote(env_config)

Configuration#

  • num_workers: Number of parallel workers for code execution

  • terminate_on_evaluation: Whether to terminate after code execution (True for single-turn, False for multi-turn).

We are tracking an end-to-end example of this environment in #858. Add a 👍 to show your interest.

Code Jaccard Environment#

The Code Jaccard Environment evaluates code (or text) responses by measuring Jaccard-based similarity against ground-truth answers. This is a lightweight, text-similarity reward useful when an execution sandbox is unnecessary or unavailable.

How It Works#

  • Extracts the assistant’s response text from each conversation.

  • Computes a Jaccard similarity score between the response and ground truth:

    • Tokenizes both texts by whitespace, computes intersection/union, then applies a length ratio penalty.

    • Scores are in [0, 1]. Observations label responses as “aligned/misaligned” using a 0.5 threshold.

  • Returns:

    • observations: Environment feedback strings.

    • rewards: Tensor of similarity scores.

    • terminateds: All ones (single-step episodes).

    • answers: The response text when requested (optional).

Usage#

from nemo_rl.environments.code_jaccard_environment import CodeJaccardEnvironment

env_config = {
    "num_workers": 2,
    # Optional default stop strings (unused in scoring but available for consistency)
    "stop_strings": None,
}

code_jaccard_env = CodeJaccardEnvironment.remote(env_config)

Configuration#

  • num_workers (int): Number of parallel verification workers.

  • stop_strings (list[str] | None): Optional default stop strings (propagated downstream; not required for scoring).

Sample GRPO Config#

env:
  code_jaccard:
    num_workers: 2
    stop_strings: null
data:
  env_name: code_jaccard

Reward Model Environment#

The Reward Model Environment uses pre-trained reward models to score conversation quality.

Usage#

from nemo_rl.environments.reward_model_environment import RewardModelEnvironment

env_config = {
    "enabled": True,
    "model_name": "Skywork/Skywork-Reward-V2-Qwen3-0.6B",
    "tokenizer": {"name": "Skywork/Skywork-Reward-V2-Qwen3-0.6B"},
    "precision": "bfloat16",
    "batch_size": 32,
    "resources": {"gpus_per_node": 1, "num_nodes": 1},
    "reward_model_cfg": {
        "enabled": True,
        "reward_model_type": "bradley_terry",
    },
}

reward_env = RewardModelEnvironment.remote(env_config)

Resource Allocation in GRPO Training#

In GRPO training, resources are allocated across three main components:

  • Policy Actor: The trained model.

  • Generation Actor: Used for generating responses during rollouts (can be colocated with policy or on separate nodes/GPUs).

  • Reward Model Environment Actor: Evaluates generated responses and computes rewards.

The resource allocation logic works as follows:

Single-Node Setup (num_nodes: 1)#

  • All components share the same node

  • GPUs are divided between policy training, generation, and reward model

  • Example:

    1. Policy and generation colocated: 8 GPUs total = 4 for colocated policy and generation + 4 for reward model

    2. Policy and generation non-colocated: 8 GPUs total = 2 for policy + 2 for generation + 4 for reward model

Multi-Node Setup (num_nodes > 1)#

  • Policy training, generation, and reward model environment can be distributed across different nodes.

  • Reward model gets dedicated resources as specified in env.reward_model.resources.

  • Generation gets dedicated resources as specified in policy.generation.colocated.resources.

  • Remaining nodes are allocated to policy training.

In the future, the resource control part will be refactored to enable fine-grained resource configuration for each actor. For detailed resource management and optimization strategies, see #1100.

Complete GRPO Training with Reward Model Environments#

See examples/run_grpo.py with examples/configs/grpo_rm_1B.yaml for a complete example of using the reward model environment with GRPO training.

uv run examples/run_grpo.py --config examples/configs/grpo_rm_1B.yaml

Registering Custom Environments#

NeMo RL provides a flexible environment registration mechanism that allows you to add custom environments without modifying the source code.

Using the register_env Interface#

You can use the register_env function to dynamically register new environments without modifying NeMo RL’s internal code.

Function Signature

from nemo_rl.environments.utils import register_env

register_env(env_name: str, actor_class_fqn: str) -> None

Parameters:

  • env_name: Unique identifier name for the environment (string)

  • actor_class_fqn: Fully Qualified Name of the environment Actor class, in the format 'module.path.ClassName'

Example: Registering a Custom Environment#

Suppose you’ve created a custom reinforcement learning environment for code generation tasks:

1. Create Your Custom Environment Actor Class

# File: my_custom_envs/code_gen_env.py
import ray
from nemo_rl.environments.interfaces import EnvironmentInterface

@ray.remote
class CodeGenEnvironmentActor(EnvironmentInterface):
    """Custom code generation environment."""
    
    def __init__(self, config):
        self.config = config
        # Initialize your environment
        
    async def reset(self):
        # Reset environment logic
        return initial_state
        
    async def step(self, action):
        # Execute action, return reward, etc.
        return observation, reward, done, info
        
    # Implement other required interface methods...

2. Register the Environment in Your Training Script

# File: train.py
from nemo_rl.environments.utils import register_env

# Register your custom environment
register_env(
    env_name="code_gen",
    actor_class_fqn="my_custom_envs.code_gen_env.CodeGenEnvironmentActor"
)

# Now you can use "code_gen" in your config
# Training code...

3. Use the Registered Environment in Your Config

# config.yaml
env:
  code_gen:
    num_workers: 2
    max_code_length: 512
    test_cases_per_problem: 5

data:
  env_name: code_gen  # Use your registered environment name