Router Replay#
Router Replay, or R3, records MoE router choices made during rollout generation and replays those choices in Megatron forward passes. This keeps each generated token’s expert assignment consistent across rollout, logprob, and training stages. Without replay, two valid router implementations can choose different experts for the same token, which introduces train-vs-rollout logprob mismatch that is unrelated to the policy update.
Router Replay is disabled by default. It is not needed for dense models. In the current NeMo RL integration, Router Replay is wired and tested for Megatron MoE policy training with vLLM rollout generation. Other inference/generation backends are not wired into this path and have not been tested with Router Replay.
Configuration#
Set policy.router_replay.enabled=true in the training config:
policy:
router_replay:
enabled: true
When Router Replay is enabled, NeMo RL configures vLLM rollout generation to
return routed expert indices by setting enable_return_routed_experts=True in
the vLLM kwargs. The generation payload is then carried through the normal
rollout and policy data path as the routed_experts field.
An example recipe is available at:
examples/configs/recipes/llm/grpo-qwen3-30ba3b-8n8g-megatron-cp2-r3.yaml
Validation#
Router Replay validation covers two end-to-end questions:
whether rollout routes are carried through TransferQueue, packing, context parallel slicing, and Megatron replay without changing token identity;
whether matched R3-on runs reduce train-vs-rollout mismatch relative to matched R3-off controls.
Trace Debugging#
Router Replay can emit JSONL traces for a small number of training steps. This is intended for correctness debugging, not long training runs.
Environment variable |
Default |
Meaning |
|---|---|---|
|
|
Master switch for R3 JSONL trace emission. |
|
|
Number of training steps to trace. |
|
|
Number of samples with full tensor previews. |
|
|
Trace output directory. |
|
|
Number of microbatches to trace per stage. |
|
|
Verifies replayed top-k indices against the installed replay tensor during forward. |
Example:
export NRL_R3_TRACE=1
export NRL_R3_TRACE_VERIFY_FORWARD=1
export NRL_R3_TRACE_STEPS=1
export NRL_R3_TRACE_SAMPLES=1
export NRL_R3_TRACE_MICROBATCHES=1
export NRL_R3_TRACE_DIR=/path/to/run/r3_trace
After the run, validate the emitted trace:
python tools/check_r3_trace.py /path/to/run/r3_trace \
--require-forward-verify \
--require-cp-identity
The checker verifies that:
rollout payload samples include both
input_idsandrouted_experts;TransferQueue fetches match the rollout payload;
context-parallel slicing preserves token identity for routed experts;
Router Replay assignments are installed for prev-logprob and train stages;
forward verification reports that replayed routes match the installed tensor.
Effectiveness Check#
Run matched R3-off controls to check that the PR does not regress existing packed-sequence and context-parallel Megatron training paths.
Run matched R3-on/R3-off pairs to measure whether Router Replay reduces train-vs-rollout mismatch under the intended rollout settings.
The main metrics to inspect are:
train/token_mult_prob_errortrain/js_divergence_error
Validation report: https://api.wandb.ai/links/nvidia-nemo-fw-public/lxoovk60
Other Notes#
Fallback for Missing Routes#
In rare cases, vLLM can return fewer routed-expert entries than expected for a
sample. NeMo RL represents each missing token route with an all--1 sentinel.
Megatron then uses its normal router only for those missing token routes, while
all returned vLLM routes are still replayed exactly.
The fallback is intentionally route-local: it does not disable Router Replay for the whole batch or sample.
When fallback is used, NeMo RL logs
r3/routed_experts_fallback_token_route_fraction. This metric should normally
be zero or near-zero. A nonzero value means some token routes used Megatron’s
normal router instead of replay.