Bump Dependency#

End-to-end workflow for shipping a dependency bump in Megatron Bridge. Optimised for the case where TE, MCore, or another GPU-heavy pin moves forward — which often surfaces flakes that have to be quarantined before the PR can land.

The pipeline is always: edit → relock → push → /ok to test → watchdog → quarantine on red → re-trigger → repeat until green.

When to reach for this skill#

Bumping a git-source pin in pyproject.toml override-dependencies (e.g. transformer-engine @ git+...@<ref>).
Bumping the 3rdparty/Megatron-LM submodule.
Any change that touches uv.lock and needs the full L0 + L1 matrix to prove out before merge.

For pure dep additions/removals without a CI loop, the build-and-dependency skill is enough.

Required context#

Read first, then follow the steps below:

@CONTRIBUTING.md — PR title/label policy, DCO sign-off
@skills/build-and-dependency/SKILL.md — uv lock mechanics, container choice
@skills/cicd/SKILL.md — how copy-pr-bot and /ok to test work
@skills/testing/SKILL.md — active/ vs flaky/ directory layout, git mv quarantine recipe

Step 1 — Worktree and edit#

Create a worktree off main per @CLAUDE.md. Then, before any uv lock:

git submodule update --init 3rdparty/Megatron-LM

The submodule must be initialised in the worktree or uv lock errors with “not a Python project” on the MCore path.

Edit the pin. For TE the canonical knob is the override line in pyproject.toml:

override-dependencies = [
    ...
    "transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@<new-ref>",
    ...
]

Use a branch name (release_v2.15) only when you want to track a moving tip; use a full SHA for reproducibility. TE branches use release_vX.Y (underscore), not release/vX.Y. Verify with git ls-remote https://github.com/NVIDIA/TransformerEngine.git.

Step 2 — Regenerate the lockfile#

Run uv lock inside the project container per @skills/build-and-dependency/SKILL.md “Regenerating uv.lock”. Then confirm only the intended packages moved:

git diff --stat pyproject.toml uv.lock

If the diff carries changes you didn’t ask for (transitive movements you can’t explain), stop and investigate before pushing. Note that override-dependencies carries CVE floors that float — unrelated packages bumping by a patch version is expected; accept those, don’t revert them.

Step 3 — Commit and push#

Sign-off + signed-commit + PR title format per @CONTRIBUTING.md and @skills/cicd/SKILL.md “Commit and PR Workflow”. For a bump:

git add pyproject.toml uv.lock
git commit -S -s -m "[build] chore: bump <package> to <ref>"
git push -u origin <branch-name>

A signed commit (-S) lets copy-pr-bot trigger CI without manual /ok to test for the first push — but you’ll still post /ok to test on every subsequent SHA in this loop (Step 5).

Step 4 — Open the PR#

Title and labels per @CONTRIBUTING.md. Two bump-specific requirements:

Apply needs-more-tests — mandatory for a bump; expands the matrix from L0 to L0+L1.
For a high-blast-radius bump (TE, MCore submodule, anything that touches CUDA kernels), also apply full-test-suite to pull L2 into the PR run. L2 covers VL models, checkpoint conversion, and heavy quantization which otherwise only run on schedule.

The PR body template — this is the durable record of the bump:

<details><summary>Claude summary</summary>

## What
- Bump <package> to <ref>.
- Regenerate `uv.lock`.

## Lockfile delta

Updated ->

## Test plan
- [ ] L0 CI green
- [ ] L1 CI green (label `needs-more-tests` applied)

## Quarantined tests (this bump)
_None yet — will be appended as flakes are identified during CI iteration._

</details>

To update the PR title or body later, use gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md" — never gh pr edit.

Step 5 — Trigger CI on the exact SHA#

Trigger mechanics live in @skills/cicd/SKILL.md “How CI Is Triggered”. For this loop the rule is simple: on every new SHA you push, post /ok to test $(git rev-parse HEAD) as a PR comment, even if your commits are signed. This guarantees the run targets the SHA you actually want exercised and re-fires anything that got cancelled or cached.

Step 6 — Attach the watchdog (always; never a cronjob)#

For a bump PR you want a single live process that emits per-job state changes for the CICD NeMo workflow only. Other workflows (docs, wheel, copyright, install-test) are noise here — the gate that decides green-or-red for a bump is CICD NeMo.

Always attach a watchdog with the Monitor tool. Never schedule wakeups or cronjobs for this loop. A watchdog gives you:

Sub-minute reaction time on every job transition.
A single live process — no scattered scheduled-wakeup state to reason about.
Natural early termination via TaskStop once the run is green.

Watchdog script#

Save to /tmp/watchdog-<PR>.sh and chmod +x:

#!/usr/bin/env bash
# Watchdog: monitor "CICD NeMo" runs on pull-request/<PR> and emit
# per-job state changes. Stays alive across re-runs (new commits).
set -u
PR=<PR>
REPO=NVIDIA-NeMo/Megatron-Bridge
BRANCH="pull-request/$PR"

prev_run_id=""
declare -A prev_state

emit() { echo "[$(date -u +%H:%M:%SZ)] $*"; }

while true; do
  run_json=$(gh run list --repo "$REPO" --workflow "CICD NeMo" \
    --branch "$BRANCH" --limit 1 \
    --json databaseId,status,conclusion,headSha 2>/dev/null || echo "[]")
  run_id=$(echo "$run_json" | jq -r '.[0].databaseId // empty')
  run_status=$(echo "$run_json" | jq -r '.[0].status // empty')
  run_conclusion=$(echo "$run_json" | jq -r '.[0].conclusion // empty')
  run_sha=$(echo "$run_json" | jq -r '.[0].headSha // empty')

  if [[ -z "$run_id" ]]; then
    sleep 30; continue
  fi

  if [[ "$run_id" != "$prev_run_id" ]]; then
    emit "RUN ${run_id} STARTED sha=${run_sha:0:8} status=${run_status}"
    prev_run_id="$run_id"
    unset prev_state
    declare -A prev_state
  fi

  jobs_json=$(gh run view "$run_id" --repo "$REPO" --json jobs 2>/dev/null || echo "{}")
  while IFS=$'\t' read -r name status conclusion; do
    [[ -z "$name" ]] && continue
    cur="${status}/${conclusion}"
    if [[ "${prev_state[$name]:-}" != "$cur" ]]; then
      case "$status" in
        completed)
          emit "JOB ${name} -> ${conclusion}" ;;
        in_progress)
          if [[ -z "${prev_state[$name]:-}" || "${prev_state[$name]}" == "queued/" ]]; then
            emit "JOB ${name} -> in_progress"
          fi ;;
      esac
      prev_state[$name]="$cur"
    fi
  done < <(echo "$jobs_json" | jq -r '.jobs[]? | [.name, .status, (.conclusion // "")] | @tsv')

  if [[ "$run_status" == "completed" ]]; then
    emit "RUN ${run_id} COMPLETED conclusion=${run_conclusion}"
  fi

  sleep 60
done

Arming the watchdog#

Monitor(
  description="CICD NeMo run state changes on PR <N>",
  command="bash /tmp/watchdog-<N>.sh",
  persistent=true,
  timeout_ms=3600000
)

persistent: true keeps it alive across re-runs (you’ll push more commits when quarantining flakes). Stop it with TaskStop(<task-id>) once the run is green.

Why never a cronjob / scheduled wakeup#

Cronjobs run blind — they fire on a clock, not on an event. You’ll either over-poll (cache miss every wake-up) or miss long stalls.
Wakeups can’t easily fan out to “tell me whenever a job transitions” — they only resume the agent on a fixed interval.
A persistent Monitor surfaces every job edge in real time and exits cleanly when the work is done.

Step 7 — Quarantine on red, then iterate#

When a JOB <name> -> failure event fires:

Triage the failure — is it the bump or a flake? Skim the logs:
```
RUN_ID=<from "RUN ... STARTED" event>
gh run view "$RUN_ID" --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > /tmp/run.log
wc -l /tmp/run.log
tail -200 /tmp/run.log
```
This is the bump-specific judgement call: only quarantine if the failure reproduces on main or is clearly unrelated infrastructure. If the failure is caused by the bump (real regression), stop quarantining — fix the underlying issue or revert the bump. Quarantining a real regression hides the very signal the bump PR exists to surface.
Move the launch script to flaky/ per @skills/testing/SKILL.md “Moving a Test to Flaky”. Map a CI job name to its launch script via:
- prefix gb200_ → gb200/active/, otherwise h100/active/
- the rest is the script’s basename without .sh
Append to the PR body’s Quarantined tests section with a one-line reason and a follow-up tracking link if you have one. This is the durable record of what this bump deferred — the section exists precisely so a reviewer can see at a glance which flakes were side-stepped to land the bump.

Commit, push, retrigger:

git commit -S -s -m "[ci] chore: quarantine flaky <test> for <package> bump"
git push
gh pr comment <N> --repo NVIDIA-NeMo/Megatron-Bridge \
  --body "/ok to test $(git rev-parse HEAD)"

Update the PR body via gh api PATCH so the quarantine list stays current.

The watchdog is persistent — it picks up the new run automatically and emits RUN <id> STARTED for the new attempt. Loop back to step 1.

Step 8 — Stop when green#

RUN <id> COMPLETED conclusion=success is the exit condition. Then:

gh pr checks <N> --repo NVIDIA-NeMo/Megatron-Bridge | awk '{print $2}' | sort | uniq -c
TaskStop(<watchdog-task-id>)
gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"

Common pitfalls#

Symptom	Cause	Fix
Wrong TE branch ref (`release/v2.15`) silently resolves nothing	TE uses `release_vX.Y` with an underscore	Verify with `git ls-remote` before locking
Lockfile diff includes unrelated CVE-pinned packages	`override-dependencies` carries floors that float	Re-run lock and accept; don’t try to revert those
Signed first push triggers CI but later pushes don’t	`copy-pr-bot` re-trusts on each new SHA only via `/ok to test` once you’re past the first signed commit in this loop	Always re-post `/ok to test $(git rev-parse HEAD)` per Step 5
Watchdog goes silent for 30+ min	`gh` rate-limited or auth expired	Bump poll interval; `gh auth status`; restart Monitor
Job name doesn’t map to a script in `active/`	`gb200_` prefix is the hardware indicator, not part of the filename	Strip `gb200_` and look in `gb200/active/`

Anti-patterns#

Cron / scheduled wakeups for this loop. Always Monitor.
Polling all workflows. Filter to CICD NeMo — the rest are noise for a bump.
Quarantining a real regression to “make CI green.” That defeats the purpose of the bump PR. Only quarantine if the failure reproduces on main or is clearly unrelated infrastructure.
gh pr edit for title/body. Use gh api PATCH.
HEREDOC in gh pr create --body. Always go through a tmpfile + --body-file.
Bundling unrelated changes (feature work, refactors) into a bump PR. Bumps should stay surgical so CI failures attribute cleanly.