CI/CD#
Commit and PR Workflow#
Never commit directly to
mainβ always create a feature branch.Always sign commits:
git commit -s -m "message".PR title format:
[{areas}] {type}: {description}(e.g.,[model] feat: Add Qwen3 model bridge). See @CONTRIBUTING.md for the full PR workflow, area/type labels, and DCO requirements.
How CI Is Triggered#
The workflow is defined in @.github/workflows/cicd-main.yml and is triggered
on push β not on pull_request. This is intentional: a bot called
copy-pr-bot controls when CI runs.
Mechanism:
When a PR is opened,
copy-pr-botwatches for a trust signal.Trust is established in one of two ways:
All commits on the PR branch are GPG-signed by a verified NVIDIA contributor β bot triggers automatically.
An NVIDIAN posts
/ok to test <commit-sha>as a PR comment β bot triggers manually for that SHA.
Once trusted,
copy-pr-botcopies the PRβs code into the remote branchpull-request/<number>and pushes it.That push fires the workflowβs
pushtrigger onrefs/heads/pull-request/<number>, launching CI.
Consequences:
CI never runs on untrusted pushes β external contributors always need
/ok to test.The running workflow branch is
pull-request/<number>, not the authorβs feature branch.Pushing a new commit to a PR does not automatically re-trigger CI unless the commit is signed or
/ok to test <new-sha>is posted.Concurrent runs for the same PR are cancelled automatically (concurrency group per PR number).
Pipeline Structure#
pre-flight
βββ lint-check
βββ cicd-wait-in-queue # queues workflows to avoid runner interleaving across PRs
βββ cicd-container-build
βββ unit-tests-core
βββ unit-tests-diffusion
βββ functional-tests (L0 always; L1 with needs-more-tests label; L2 on schedule or full-test-suite label)
Slack notifications are sent on completion for scheduled and nightly runs.
For functional test tier semantics and job-to-directory mapping, see the testing skill.
CI Failure Investigation#
Locating the PR from a CI Branch#
# Extract PR number from branch name (e.g. pull-request/1234)
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')
gh pr view "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge
gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge --name-only
gh pr checks "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge
Investigating a Failing Job#
Get the PR number from the branch name (see above).
Review the changeset:
gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge
Identify the failing job from
gh pr checksoutput.Fetch job logs:
gh run list --repo NVIDIA-NeMo/Megatron-Bridge --branch "pull-request/$PR_NUMBER" gh run view <run_id> --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > run.log
Scan logs in chunks β log files can exceed 10,000 lines, never load them whole:
wc -l run.log tail -200 run.log # start from the end sed -n '1,200p' run.log # or scan forward in 200-line chunks
Cross-reference the changeset against the failing step.
Common Failure Patterns#
Symptom |
Likely Cause |
Action |
|---|---|---|
CI never started on a PR |
Commits not GPG-signed and no |
Post |
Lint job fails |
|
Run |
Container build fails |
Dependency conflict or stale |
Re-run |
Unit tests fail |
Code regression or missing import |
Run failing test locally; check the PR diff |
Functional test (L0) fails |
Integration breakage |
Check GPU runner logs; reproduce with |
|
Many PRs queued; automation serializes runners to avoid interleaving |
Wait; or check queue depth in the Actions tab |
MCore submodule mismatch |
Pinned commit out of sync |
Update |
Stale checkpoint auto-resume |
|
|
Port collision on Slurm (EADDRINUSE) |
|
Drop torchrun; use |