nemo_microservices.types.customization.grpo_parameters_param#

Module Contents#

Classes#

Data#

API#

nemo_microservices.types.customization.grpo_parameters_param.Environment: typing_extensions.TypeAlias#

None

class nemo_microservices.types.customization.grpo_parameters_param.GrpoParametersParam#

Bases: typing_extensions.TypedDict

environment: typing_extensions.Required[nemo_microservices.types.customization.grpo_parameters_param.Environment]#

None

Task specific environment configuration defining the training context including dataset specification and reward function.

generation_batch_size: int#

None

Batch size for generation during rollouts.

Controls how many sequences are generated in parallel.

generation_pipeline_parallel_size: int#

None

Number of GPUs to use for pipeline parallelism during generation, splitting model layers across devices (inter-layer parallelism).

generation_temperature: float#

None

Sampling temperature for generation.

Higher values (e.g., 1.0) increase randomness, lower values (e.g., 0.1) make output more deterministic. Temperature of 0 is equivalent to greedy sampling (always selecting the most likely token).

generation_top_k: int#

None

Top-k sampling parameter.

Only the k most likely tokens are considered at each step. None means no top-k filtering is applied. Typically used with values like 50 to balance diversity and quality.

generation_top_p: float#

None

Nucleus sampling parameter (top-p).

Only tokens with cumulative probability >= top_p are considered. 1.0 means no filtering; lower values (e.g., 0.9) increase quality by filtering unlikely tokens.

logprob_chunk_size: int#

None

Chunk size for processing logprobs in distributed settings.

Larger values improve efficiency but require more memory. Used for chunked distributed operations during loss computation.

max_grad_norm: float#

None

Maximum gradient norm for gradient clipping during training.

Prevents exploding gradients by scaling down gradients that exceed this threshold. Lower this value (e.g., 0.5) if you observe training instability, NaN losses, or erratic loss spikes. Increase it (e.g., 5.0) if training seems overly conservative or progress is too slow. Typical values range from 0.5 to 5.0.

normalize_rewards: bool#

None

Normalize advantages by dividing by their standard deviation across responses to each prompt. Default is True for improved training stability and consistent gradient magnitudes regardless of reward scale variations. This prevents prompts with high reward variance from dominating updates. Disable (False) only if: (1) rewards are already well-scaled and consistent, (2) you want to preserve reward magnitude information where higher-value tasks should have stronger learning signals, or (3) using very few generations per prompt (<4) where standard deviation estimates are noisy. Recommended: keep enabled for most use cases.

num_generations_per_prompt: int#

None

Number of responses to generate for each prompt.

Used to compute the advantage baseline by comparing multiple responses to the same prompt. Higher values (e.g., 4-8) provide better advantage estimates but increase computational cost. Typical range: 4-16.

num_prompts_per_step: int#

None

Number of unique prompts to process per training step.

This controls the batch size for sampling prompts from the dataset. Total samples per step = num_prompts_per_step ** num_generations_per_prompt. Increase for better gradient estimates and training stability (at the cost of memory). Typical values: 8-64 depending on available GPU memory.

overlong_filtering: bool#

None

Exclude truncated sequences (those that hit max_total_sequence_length without producing end-of-text) from loss computation. Truncated samples still contribute to advantage baseline calculations but don’t receive gradient updates. Enable (True) for long-form tasks like mathematical proofs or extended reasoning where correct answers may legitimately exceed length limits and shouldn’t be penalized for incompleteness. Default is False to maintain standard GRPO behavior where the model learns to complete responses within sequence limits, which is appropriate for most tasks and production systems with length constraints.

ratio_clip_c: float#

None

Dual-clipping parameter that adds extra protection against large policy updates when rewards are negative. Must be greater than 1 (typically 3). Set to None to disable. This helps prevent the policy from changing too aggressively on poor-performing samples.

ratio_clip_max: float#

None

Upper bound for clipping the policy update ratio in GRPO loss.

Limits how much the policy can change per update, preventing instability. Standard value: 0.2 (clips to [0.8, 1.2]). Usually set equal to ratio_clip_min (symmetric clipping), but can differ for asymmetric clipping strategies where you want to limit increases differently than decreases.

ratio_clip_min: float#

None

Lower bound for clipping the policy update ratio in GRPO loss.

Limits how much the policy can change per update, preventing instability. The policy ratio is clipped to stay within [1-epsilon, 1+epsilon]. Standard value: 0.2 (clips to [0.8, 1.2]). Lower values (e.g., 0.1) make training more conservative; higher values (e.g., 0.3) allow larger updates. Typically set equal to ratio_clip_max for symmetric clipping.

ref_policy_kl_penalty: float#

None

KL divergence penalty coefficient (β) that controls how strongly the trained policy is penalized for deviating from the reference policy. Higher values (e.g., 0.05-0.1) encourage the policy to stay closer to the reference (more conservative learning), while lower values (e.g., 0.001-0.01) allow more freedom to explore user-preferred behavior. Typical range: 0.001-0.1. Also known as ‘beta’ in the original GRPO paper and ‘kl_penalty_coefficient’ in some implementations.

token_level_loss: bool#

None

Whether to compute loss at token level (True) or sequence level (False).

Token-level averages over all tokens; sequence-level averages per-sequence losses. Sequence-level is used for GSPO-style training.

use_importance_sampling_correction: bool#

None

Correct for numerical differences between the inference backend (used for generation) and training framework (used for learning). This accounts for precision differences, backend variations, etc. that can cause the same model to produce slightly different probabilities. Recommended for async GRPO and when using FP8 inference.

use_on_policy_kl_approximation: bool#

None

Use importance-weighted KL divergence estimation between current and reference policies. This provides a more accurate, always-positive estimate of how much the policy has changed by accounting for the difference between the policy used for sampling and the current policy being trained. Enable when you need precise KL tracking. Default: False for efficiency.

use_rloo: bool#

None

Use leave-one-out baseline (Reinforcement, Leave One Out) for computing advantages. When True, each sample’s baseline excludes its own reward, providing an unbiased estimate of expected reward. Default is True as it’s theoretically correct and works well with typical num_generations_per_prompt values (4-8). Disable (False) for: (1) very few generations per prompt (≤3) where leave-one-out baselines become too noisy, (2) faster training by avoiding per-sample baseline computation, or (3) replicating original GRPO paper. The tradeoff: True gives unbiased but higher variance estimates; False gives biased but lower variance, which can improve stability with small generation counts.