verl E2E Testing#

Validate a Megatron-Bridge model addition through verl’s Megatron backend. This catches integration issues that Bridge-only conversion tests miss: provider configuration, HF import through Bridge, PEFT wrapping, DDP wrapping, optimizer setup, rollout/ref wiring, and checkpoint ownership by an external RL loop.

Use this as an external compatibility smoke test after the Bridge unit and functional tests for a new model provider are green.

This is not a replacement for Bridge model parity tests. The default verl PPO run proves that the provider can survive an external RL training loop; architecture-specific correctness still comes from Bridge import/export, logits/roundtrip, and model-specific inference tests.

Scope#

Think in coverage levels. Start with Level 0 and add only the levels justified by the change.

Level

Required when

What it proves

0: LoRA + DDP smoke

Any new provider or provider config change that claims verl compatibility

verl can import the local Bridge provider, apply PEFT, wrap with Megatron DDP, build optimizer state, run rollout/ref/critic wiring, and finish one PPO step

1: Save/resume

PEFT, checkpointing, HF export, adapter export, optimizer state, or resume behavior changed

verl-owned checkpoint scheduling can save and reload Bridge-built model state

2: Parallelism stress

Provider finalization, mpu-derived settings, TP/PP/CP/EP, sequence parallel, or dispatcher behavior changed

provider settings remain correct under non-trivial Megatron parallel state

3: Optional Megatron-FSDP

Only when downstream explicitly asks for verl Megatron-FSDP coverage or the change directly touches that integration path

the same provider works when verl selects Megatron-FSDP instead of DDP

4: Architecture-specific e2e

VLM, MoE, MTP, QAT/ModelOpt, quantized weights, or custom layer behavior is involved

the part of the architecture not exercised by text-only GSM8K also has a targeted runtime check

5: Convergence / learning signal

Optimizer, scheduler, loss, reward, PEFT trainability, gradient flow, or model-specific training stability changed

metrics move in the expected direction over a short run and do not silently produce zero/NaN/unstable updates

The default Level 0 target is a short, non-vanilla Bridge run in verl with LoRA enabled and Megatron DDP selected:

USE_MBRIDGE=True
VANILLA_MBRIDGE=False
VALUE_VANILLA_MBRIDGE=False
LORA_RANK=4
USE_MEGATRON_FSDP=False
TOTAL_TRAIN_STEPS=1

This is intentionally small. It exercises the Bridge-facing path in verl without making Megatron-Bridge own rollout scheduling, reward handling, optimizer scheduling, or checkpoint orchestration.

Level 0 is not a convergence test. It only proves the training loop can complete one update. Use Level 5 when the question is whether the model actually learns under verl.

Megatron-FSDP is not part of the default validation expected for current provider compatibility work. Run it only for Level 3 coverage when FSDP is explicitly in scope:

USE_MEGATRON_FSDP=True
ALL_OFFLOAD=False
COMMON_PP=1
COMMON_VPP=null
COMMON_CP=1
COMMON_TP=1
INFER_TP=1

Repos#

Use explicit repo variables. Do not rely on an installed megatron-bridge wheel; the purpose is to test the current Bridge checkout.

Use the upstream verl repository as the default source:

https://github.com/verl-project/verl

If a checkout is not already available, clone it next to the Bridge checkout or into the site’s standard workspace:

git clone https://github.com/verl-project/verl.git /path/to/verl
export BRIDGE_REPO=${BRIDGE_REPO:-/path/to/Megatron-Bridge}
export VERL_REPO=${VERL_REPO:-/path/to/verl}
export PYTHONPATH="${BRIDGE_REPO}/src:${BRIDGE_REPO}/3rdparty/Megatron-LM:${VERL_REPO}:${PYTHONPATH:-}"

Before running, record both states:

git -C "$BRIDGE_REPO" status --short
git -C "$VERL_REPO" status --short
git -C "$BRIDGE_REPO" rev-parse --short HEAD
git -C "$VERL_REPO" rev-parse --short HEAD

If testing on a remote GPU machine, sync the exact local changes first. Do not reset or overwrite unrelated changes in either tree.

Verify that Python imports the checkout under test:

python - <<'PY'
import megatron.bridge
print(megatron.bridge.__file__)
PY

The printed path must live under $BRIDGE_REPO/src. If it points at site-packages, fix PYTHONPATH before trusting any result.

When running against an existing Ray cluster, the driver import is not enough. Ray workers may not inherit the shell PYTHONPATH, and the run can fail later with ModuleNotFoundError: No module named 'megatron.bridge' even though the driver import passed. Verify a worker import:

python - <<'PY'
import os
import ray

ray.init(address="auto", runtime_env={"env_vars": {"PYTHONPATH": os.environ["PYTHONPATH"]}})

@ray.remote
def bridge_path():
    import megatron.bridge
    return megatron.bridge.__file__

print(ray.get(bridge_path.remote()))
PY

If the worker import fails, pass the checkout path through Ray’s runtime environment when launching the wrapper:

bash tests/special_e2e/run_ppo_trainer_megatron.sh \
  "++ray_kwargs.ray_init.runtime_env.env_vars.PYTHONPATH=${PYTHONPATH}"

If this import fails before model construction, fix the runtime environment first. The official verl image may not contain every Bridge dependency; for example, Bridge imports modelopt through AutoBridge, so a missing nvidia-modelopt can fail the smoke before verl exercises the provider:

python -m pip show nvidia-modelopt || \
  python -m pip install --extra-index-url https://pypi.nvidia.com nvidia-modelopt

Treat ad-hoc installs as container setup evidence, not repository changes. If the image lacks uv, run focused Bridge unit tests in a Bridge development environment instead of forcing them through the verl container.

Model Choice#

Prefer the smallest public HF checkpoint that uses the changed provider family. For example, use a 0.5B or 0.6B dense checkpoint for dense provider changes before testing larger variants.

If there is no small public checkpoint for the new architecture, use verl’s dummy-model path with a minimal HF config from that architecture:

USE_DUMMY_MODEL=True
DUMMY_MODEL_CONFIG_PATH=/path/to/minimal_config.json
MODEL_ID=<org>/<representative-model-name>

Report dummy-model results carefully: they validate model construction and training mechanics, not pretrained weight compatibility.

For VLMs, the generic GSM8K PPO run is text-only. It can validate the language-side Bridge provider and external-loop wrapping, but it does not prove image/video preprocessing or vision encoder execution. Pair it with the VLM conversion/inference tests from @skills/adding-model-support/tests-and-examples.md, or use a verl multimodal training command if one exists for the model family.

For MoE models, Level 0 with COMMON_EP=1 still catches many provider and PEFT issues, but it does not stress expert parallel routing. Add a Level 2 run with expert parallelism when the change touches expert layout, dispatcher config, router replay, or expert tensor parallelism.

For MTP, QAT/ModelOpt, or quantized checkpoint support, the generic wrapper may not activate the feature. Use the closest verl example or model-family script that turns the feature on, and record the extra Hydra overrides in the report.

Bridge Checks First#

Run focused Bridge tests before the external verl e2e. Include any model-specific tests added by the change.

cd "$BRIDGE_REPO"
uv run python -m pytest -q \
  tests/unit_tests/models/test_model_provider_mixin.py \
  tests/unit_tests/models/test_param_mapping.py \
  tests/unit_tests/training/test_integration.py \
  <model-specific-test-paths>

For a new model family, also run the relevant conversion or roundtrip test from the model’s PR. See @skills/adding-model-support/tests-and-examples.md for model-test patterns.

Minimum Bridge-side evidence for a new model/provider:

  • provider/config unit tests

  • parameter mapping tests

  • HF to Megatron import or roundtrip on a small model

  • model-specific generation or logits comparison when available

  • this verl external-loop smoke after the above pass

verl Data Setup#

verl’s Megatron PPO smoke wrapper expects GSM8K parquet files by default. Prepare them once from the verl checkout if they are missing:

cd "$VERL_REPO"
export PYTHONPATH="$VERL_REPO:${PYTHONPATH:-}"
python3 examples/data_preprocess/gsm8k.py \
  --local_save_dir "${GSM8K_DIR:-$HOME/data/gsm8k}"

Use --local_dataset_path "$GSM8K_SOURCE_DIR" only when that raw local dataset path actually exists. Otherwise let datasets fetch openai/gsm8k.

Set explicit paths when running in a container or shared filesystem:

export TRAIN_FILES=/path/to/gsm8k/train.parquet
export VAL_FILES=/path/to/gsm8k/test.parquet

The wrapper also enables a reward model by default. Ensure the default reward model path exists, or set:

export RM_MODEL_PATH=/path/to/local/reward/model

For a Level 0 rule-reward smoke, it is acceptable to disable the reward-model rollout when no local reward model is available:

bash tests/special_e2e/run_ppo_trainer_megatron.sh \
  reward.reward_model.enable=False

Report this as a limitation; it still tests Bridge actor/ref/critic construction, LoRA, DDP wrapping, rollout, and one PPO update, but not reward-model serving.

Minimal verl Run#

Use verl’s maintained wrapper rather than constructing a long Hydra command manually:

cd "$VERL_REPO"
ray stop --force || true

export MODEL_ID=<small-compatible-hf-model>
export TRAIN_FILES=${TRAIN_FILES:-/path/to/gsm8k/train.parquet}
export VAL_FILES=${VAL_FILES:-/path/to/gsm8k/test.parquet}
export RM_MODEL_PATH=${RM_MODEL_PATH:-/path/to/local/reward/model}
export ENGINE=vllm
export USE_MBRIDGE=True
export VANILLA_MBRIDGE=False
export VALUE_VANILLA_MBRIDGE=False
export LORA_RANK=4
export USE_MEGATRON_FSDP=False
export COMMON_PP=1
export COMMON_VPP=null
export COMMON_CP=1
export COMMON_TP=1
export INFER_TP=1
export ALL_OFFLOAD=False
export TOTAL_TRAIN_STEPS=1
export SAVE_FREQ=-1
export VAL_BEFORE_TRAIN=False
export TEST_FREQ=-1

bash tests/special_e2e/run_ppo_trainer_megatron.sh

Use MODEL_ID when the checkpoint is available through the wrapper’s default cache layout. Add MODEL_PATH=/path/to/local/hf/model only when testing a local or converted checkpoint.

When $HOME is small or shared slowly, put HF caches and downloaded checkpoints on a larger shared or node-local scratch path and pass MODEL_PATH explicitly. Pre-download large models once in the same container environment to avoid Ray workers racing the cache:

export HF_HOME=${HF_HOME:-/scratch/$USER/verl_hf}
export HF_HUB_CACHE=$HF_HOME/hub
MODEL_PATH=${MODEL_PATH:-/scratch/$USER/models/<org>/<model>}
hf download <org>/<model> --local-dir "$MODEL_PATH"

Capture logs to a file for review:

mkdir -p "${LOG_DIR:-$PWD/verl_e2e_logs}"
LOG_FILE="${LOG_DIR:-$PWD/verl_e2e_logs}/verl_lora_ddp_$(date +%Y%m%d_%H%M%S).log"
bash tests/special_e2e/run_ppo_trainer_megatron.sh \
  "++ray_kwargs.ray_init.runtime_env.env_vars.PYTHONPATH=${PYTHONPATH}" \
  2>&1 | tee "$LOG_FILE"
grep -E "Training Progress|VANILLA_MBRIDGE|Traceback|RuntimeError|KeyError|ValueError" "$LOG_FILE"

Prefer a saved log over a pasted terminal excerpt in PR descriptions.

For time-limited smoke runs where vLLM CUDA graph capture dominates setup and the goal is Bridge provider validation, it is acceptable to add:

bash tests/special_e2e/run_ppo_trainer_megatron.sh \
  actor_rollout_ref.rollout.enforce_eager=True

Report this override as a limitation. It still validates Bridge import, HF import, LoRA, Megatron DDP, rollout wiring, and one PPO update, but not vLLM CUDA graph capture.

Save/Resume Coverage#

After the minimal run passes, add checkpoint coverage if the change touches PEFT, checkpointing, export, or optimizer state:

# Save once.
SAVE_FREQ=1 TOTAL_TRAIN_STEPS=1 \
bash tests/special_e2e/run_ppo_trainer_megatron.sh

# Resume and train one more step.
RESUME_MODE=auto SAVE_FREQ=1 TOTAL_TRAIN_STEPS=2 \
bash tests/special_e2e/run_ppo_trainer_megatron.sh

Remove stale verl checkpoints/ output between unrelated experiments. Keep it for resume validation.

Parallelism Stress#

Use Level 2 when the provider reads or mutates parallelism-related fields, or when the change touches provider.configure(...), Megatron mpu, sequence parallel, context parallel, MoE dispatcher behavior, or tensor/expert tensor parallel settings.

The variants below assume the Level 0 exports above are still in the shell; each command overrides only the values being tested.

Example dense stress variant:

COMMON_TP=2 \
COMMON_PP=2 \
COMMON_VPP=null \
COMMON_CP=1 \
INFER_TP=2 \
USE_MEGATRON_FSDP=False \
bash tests/special_e2e/run_ppo_trainer_megatron.sh

Example MoE stress variant, only for compatible MoE checkpoints:

COMMON_EP=2 \
COMMON_ETP=1 \
ROUTING_REPLAY_MODE=disabled \
bash tests/special_e2e/run_ppo_trainer_megatron.sh

Keep these as follow-up runs. Do not make them the first debugging surface for a new provider.

Optional Megatron-FSDP Variant#

Use Level 3 after Level 0 passes only when downstream explicitly requests Megatron-FSDP coverage or the Bridge change directly touches FSDP wrapping, sharding, checkpoint format, or distributed optimizer behavior:

USE_MEGATRON_FSDP=True \
ALL_OFFLOAD=False \
COMMON_PP=1 \
COMMON_VPP=null \
COMMON_CP=1 \
COMMON_TP=1 \
INFER_TP=1 \
bash tests/special_e2e/run_ppo_trainer_megatron.sh \
  ++actor_rollout_ref.actor.megatron.override_transformer_config.gradient_accumulation_fusion=False \
  ++actor_rollout_ref.ref.megatron.override_transformer_config.gradient_accumulation_fusion=False \
  ++critic.megatron.override_transformer_config.gradient_accumulation_fusion=False

For Bridge-native FSDP behavior and constraints, also read @skills/perf-megatron-fsdp/SKILL.md.

Convergence / Learning Signal#

Use Level 5 only when the change affects trainability or when downstream validation explicitly asks for convergence. Do not require it for every provider-only PR; RL convergence is slower, noisier, and more environment-dependent than the compatibility smoke.

The goal is a short learning-signal run, not a full benchmark. Prefer a small model, fixed data, fixed seed when available, and enough steps to observe non-random metric movement:

TOTAL_TRAIN_STEPS=20 \
SAVE_FREQ=-1 \
VAL_BEFORE_TRAIN=True \
TEST_FREQ=10 \
LORA_RANK=4 \
USE_MBRIDGE=True \
VANILLA_MBRIDGE=False \
VALUE_VANILLA_MBRIDGE=False \
USE_MEGATRON_FSDP=False \
ENGINE=vllm \
bash tests/special_e2e/run_ppo_trainer_megatron.sh

For a stronger signal, run 50-100 steps if GPU time allows. Keep rollout, reward model, dataset, batch sizes, and model checkpoint fixed between baseline and candidate runs.

Acceptable convergence evidence depends on the task, but the report should include at least:

  • no NaNs or infs in loss, reward, KL, entropy, grad norm, or logprob metrics

  • nonzero trainable parameter count when PEFT is enabled

  • actor/critic losses and reward-related metrics logged for multiple steps

  • validation or reward trend compared against the starting point or a known-good baseline

  • no repeated zero gradients, frozen LoRA adapters, or constant logprobs unless expected

Do not call a 20-step run “converged” in the benchmark sense. Call it “learning-signal passed” unless it reaches a pre-agreed metric threshold.

Container Image#

Use the official verl Docker images as the default source:

https://hub.docker.com/r/verlai/verl

For this skill’s default PPO smoke path, pick a vLLM-flavored verlai/verl image tag unless the test intentionally changes the rollout engine. The maintained wrapper defaults to vLLM, and the command should make that explicit with:

ENGINE=vllm

Avoid using sglang, TRT-LLM, or generic images for the default Level 0 run unless the point of the test is to validate that rollout backend. A backend-specific image can fail before Bridge model construction, which makes the result a poor signal for a Megatron-Bridge provider change.

Pin the exact image tag in the test log or PR description:

export VERL_IMAGE=${VERL_IMAGE:-verlai/verl:<vllm-compatible-tag>}

If the cluster requires Enroot/SquashFS images, convert or mirror the selected verlai/verl tag through the site’s normal process, but keep the source tag visible in the report.

Slurm or Container Runs#

Use the cluster’s standard container and mount both checkouts into the container. Keep setup and the actual PPO run in the same container step when using node-local paths such as /tmp; node-local model caches and ad-hoc pip installs disappear when a fresh container step starts. Keep paths generic in scripts committed to Megatron-Bridge:

export VERL_IMAGE=${VERL_IMAGE:-verlai/verl:<vllm-compatible-tag>}

srun <site-specific-slurm-options> \
  --container-image="${VERL_IMAGE}" \
  --container-mounts="${BRIDGE_REPO}:/workspace/Megatron-Bridge,${VERL_REPO}:/workspace/verl,<data-root>:<data-root>" \
  --container-workdir=/workspace/verl \
  bash -lc '
    export BRIDGE_REPO=/workspace/Megatron-Bridge
    export VERL_REPO=/workspace/verl
    export PYTHONPATH=$BRIDGE_REPO/src:$BRIDGE_REPO/3rdparty/Megatron-LM:$VERL_REPO
    ray stop --force || true
    MODEL_ID=<small-compatible-hf-model> \
    ENGINE=vllm \
    USE_MBRIDGE=True VANILLA_MBRIDGE=False VALUE_VANILLA_MBRIDGE=False \
    LORA_RANK=4 USE_MEGATRON_FSDP=False TOTAL_TRAIN_STEPS=1 SAVE_FREQ=-1 \
    bash tests/special_e2e/run_ppo_trainer_megatron.sh
  '

Use a persistent log directory on a shared filesystem or $HOME for long Slurm jobs. Logs written only to node-local /tmp can disappear when the allocation expires or is canceled. If a cluster attach helper runs the command through srun --pty, do not background the workload inside that attached shell; the step cleanup can terminate it immediately. To detach safely, background the attach helper itself from the login node:

mkdir -p "$HOME/verl_e2e_logs"
nohup env COMMAND='bash /path/to/run_verl_e2e.sh' \
  bash /path/to/<jobid>-attach.sh \
  > "$HOME/verl_e2e_logs/attach_driver_$(date +%Y%m%d_%H%M%S).out" 2>&1 &

If an attach helper enters a container that no longer sees the expected checkouts or log directory, treat that helper as stale. Start a fresh srun step against the existing allocation with explicit --container-image, --container-mounts, and --container-workdir.

On CUDA/H100 clusters, some launchers set both CUDA_VISIBLE_DEVICES and ROCm variables such as ROCR_VISIBLE_DEVICES. If verl workers fail before model construction with Please don't set ROCR_VISIBLE_DEVICES when HIP/CUDA_VISIBLE_DEVICES is set, fix the launcher/container environment or apply a temporary local verl workaround that drops ROCR_VISIBLE_DEVICES when CUDA is already set. Do not report this as a Bridge provider failure.

For general Slurm debugging and multi-node patterns, read @skills/multi-node-slurm/SKILL.md.

Pass Criteria#

A useful pass has all of the following:

  • Focused Bridge tests pass for provider/config/mapping behavior.

  • verl uses the local Bridge checkout through PYTHONPATH.

  • The verl log shows VANILLA_MBRIDGE=False.

  • One training step reaches completion, for example Training Progress: 100%|1/1|.

  • No exception occurs during Bridge provider setup, HF import, LoRA wrapping, DDP wrapping, optional FSDP wrapping when enabled, optimizer setup, checkpoint manager setup, or the training step.

Ray shutdown, Python resource-tracker warnings, or post-completion DataLoader worker termination can be acceptable if the training step completed, metrics for training/global_step:1 were logged, and the process exits successfully. Mention them as residual log noise.

Do not claim full model e2e if the run used a dummy config, text-only data for a VLM, COMMON_EP=1 for an expert-parallel change, or disabled save/resume for a checkpointing change. Call it the exact level that passed.

Do not claim convergence from Level 0. Claim convergence only from Level 5, and distinguish “learning signal” from “benchmark convergence” in the report.

Failure Triage#

If model construction fails, check whether the Bridge provider is finalized with the same parallel sizes that verl initialized through Megatron mpu.

If LoRA fails, check target module names and whether the provider path uses the non-vanilla Bridge PEFT helpers expected by verl.

If checkpoint save/load fails, first rerun without save/resume (SAVE_FREQ=-1) to separate model construction from checkpoint behavior.

If rollout fails before actor construction, this may be a verl rollout engine issue rather than a Bridge provider issue. Report the boundary clearly.

If the log shows the wrong Bridge path, stop. Any later failure or pass is not evidence for the local Bridge change.

If the baseline fails before model build because of data, reward model, Ray, vLLM, or package setup, fix the environment first and do not report it as a provider failure.

If model download fails with No space left on device, move HF_HOME, HF_HUB_CACHE, and MODEL_PATH to a larger shared or node-local path, then rerun with the explicit local MODEL_PATH.

Summary Format#

End every run with a short user-facing summary that answers “Did the requested deliverables pass?” before adding details. Use Pass, Fail, Skipped, or Blocked for each deliverable, and do not report an overall Pass unless the pass criteria for the requested coverage level were met.

Result: <Pass/Fail/Blocked> - <one sentence stating what was validated>
Requested coverage: <Level 0/1/2/3/4/5 and requested variants>
Model: <MODEL_ID or MODEL_PATH>

Deliverables:
- Bridge-side checks: <Pass/Fail/Skipped> - <test command or skipped reason>
- Local Bridge import in verl: <Pass/Fail> - <PYTHONPATH or imported Bridge path>
- verl Megatron backend run: <Pass/Fail/Skipped> - <LoRA + DDP or requested variant>
- Requested variants: <Pass/Fail/Skipped/Not requested> - <save/resume, parallelism stress, Megatron-FSDP, architecture-specific run, or learning-signal/convergence>
- Log capture: <Pass/Fail> - <log path>

Evidence:
- Bridge repo: <commit> plus dirty files
- verl repo: <commit> plus dirty files
- Command: <exact command or script path>
- Key lines: <VANILLA_MBRIDGE=False, Training Progress completion, training/global_step:1, or the first relevant error>

Limitations:
- <dummy model, skipped save/resume, COMMON_EP=1, text-only data for VLM, no convergence claim, known shutdown warnings, etc.>

Follow-ups:
- <needed rerun, environment fix, provider fix, or "none">

If the job is blocked before Bridge model/provider construction by data, reward model, Ray, vLLM, dependency, disk, or cluster setup, mark the overall result as Blocked, not Fail, and state that it is not evidence against the Bridge provider.

If any requested deliverable was not run, mark it Skipped or Not requested with the reason. Do not leave it implicit in the limitations.