CI/CD#
NeMo RL uses GitHub Actions for continuous integration, testing, and release automation. The CI pipeline implements a tiered testing system that balances thoroughness with resource efficiency.
Test Levels#
Tests are organized into levels of increasing scope and cost:
Level |
What runs |
When |
|---|---|---|
docs |
Doctests only |
|
L0 |
Doctests + unit tests (3 parallel suites: Generation, Policy, Other) |
|
L1 |
Doctests + unit tests + functional tests (GPU) |
|
L2 |
Full suite including convergence tests |
|
Lfast |
Fast unit + functional tests, reuses pre-built main container (skips build) |
|
Defaults:
PRs do not run tests unless a CI label is applied.
Pushes to
mainand merge-group events force L1.Nightly scheduled runs (09:00 UTC) run the full suite.
Doc-only changes are auto-detected and skip unnecessary tests.
Triggering CI on Pull Requests#
Apply a CI label to your PR:
CI:docs,CI:L0,CI:L1,CI:L2, orCI:Lfast.Comment
/ok to test <commit-sha>— a bot will acknowledge with a thumbs-up and start CI.If you are an external contributor, you will need an internal NVIDIA developer to comment this on your PR to trigger CI.
The
Skip CICDlabel bypasses tests entirely (except onmain/merge-group).
Required Checks#
All PRs must pass these checks before merging:
Lint: ruff + pyrefly via pre-commit
Branch freshness: PR branch must be at most 10 commits behind the base branch
Semantic PR title: must follow conventional commit format
DCO sign-off: all commits must be signed with
--signoff(see CONTRIBUTING.md)Secrets detection: scans for accidentally committed secrets
Submodule validation: Automodel submodule must be fast-forwarded from the base branch
Megatron-Bridge dependency sync:
pyproject.tomldependencies must match the Megatron-Bridge submodule metadata
CI Pipeline Architecture#
The main pipeline (cicd-main.yml) runs through these stages:
Pre-flight: determines test level from PR labels, changed files, and event type
Container build: Docker image built on GPU runners (skipped for
Lfastor when a pre-builtimage_tagis provided viaworkflow_dispatch)Tests: run in containers on GPU runners using the custom
test-templateactionCoverage: aggregated from doc-tests, unit-tests, and e2e; uploaded to Codecov
QA Gate: aggregates all job results into a single pass/fail status
Code Review#
Commenting /claude-review on a PR triggers an AI-powered code review. This is restricted to org members.
Nightly Runs#
Full test suite runs daily at 09:00 UTC on
main. Failures send Slack alerts.Nightly docs are published at 10:00 UTC to a separate “nightly” version (does not overwrite stable “latest” docs).
Release Process#
All release workflows are manual (workflow_dispatch) with dry-run defaults:
Workflow |
Purpose |
|---|---|
|
Create release branch and version bump |
|
Build wheel, create GitHub release, generate changelog |
|
Publish docs to S3 + Akamai CDN (versioned and/or “latest”) |
|
Auto-publish to TestPyPI on main/release pushes (dry-run by default) |
Infrastructure#
VM health checks (
healthcheck_vms.yml): daily GPU health checks (07:00 UTC) on self-hosted runners. Auto-reboots degraded VMs and alerts via Slack on persistent failures.Merge queue retry (
merge-queue-retry.yml): auto-retries PRs dequeued due to CI timeout (max 3 retries before alerting).Stale cleanup (
close-inactive-issue-pr.yml): daily auto-close of inactive issues and PRs.Cherry-pick (
cherry-pick-release-commit.yml): auto-creates cherry-pick PRs from release branches back to main.Community bot (
community-bot.yml): syncs issues and comments to a GitHub Project board for tracking.
Workflow Reference#
Workflow |
Trigger |
Purpose |
|---|---|---|
|
push, PR, schedule, dispatch |
Main CI pipeline |
|
push main/r** |
Wheel build + TestPyPI |
|
dispatch |
Full release |
|
dispatch |
Code freeze |
|
dispatch, callable |
Publish docs to S3/CDN |
|
schedule (10:00 UTC) |
Nightly docs publish |
|
PR |
Secrets scanning |
|
PR |
PR title validation |
|
PR |
Auto-label by file path |
|
|
AI code review |
|
schedule (07:00 UTC), dispatch |
GPU runner health |
|
PR |
Submodule validation |
|
PR (specific paths) |
Dependency sync check |
|
PR dequeued (timeout) |
Auto-retry merge queue |
|
push main |
Release cherry-picks |
|
schedule (01:30 UTC) |
Stale issue/PR cleanup |
|
issues, comments |
Project board sync |
|
PR |
Post submodule check results |