Release and QA Process
This page describes how the NVIDIA Infra Controller (NICo) project is branched, versioned, tested, and released. It is intended for both contributors and operators who want to understand which version of NICo they should be running.
TL;DR
- Use the latest final
vX.Y.Ztag for production-style deployments. main,-pr, and-rcbuilds are for prerelease testing.- Every month,
mainbranches toreleases/vX.Y; after one month of QA, that branch becomes the finalvX.Y.0release. - Patch releases stay on the same
releases/vX.Ybranch and ship only when fixes warrant them. - NICo keeps three minor releases visible: Current, Maintenance, and EOL. Upgrades are supported from EOL to Maintenance or Current, from Maintenance to Current, and to newer patches in the same minor. Anything older than EOL has no supported upgrade path.
- Guaranteed public APIs stay backward-compatible within a major version. Breaking removals require a future major release and at least one full three-month roadmap window of notice.
Where Releases Live
- GitHub releases: https://github.com/NVIDIA/infra-controller/releases
- Issue tracker: https://github.com/NVIDIA/infra-controller/issues
- Source: https://github.com/NVIDIA/infra-controller
Every published minor and patch release is available on the GitHub releases page above, tagged with its semver version (see Tag Naming below).
Branches
NICo uses just two long-lived branch types — main and per-minor-version
release branches — together with semver tags that distinguish prereleases,
release candidates, and final releases. The -rc and -pr suffixes are
tag suffixes, not branch suffixes.
main — Ongoing Development
All changes land on main first. There is no expectation of stability on
main; it is not QA tested. The only tests that gate changes to main are the
automated tests that run in CI. Features may be incomplete and bugs may be
present at any commit.
Use main if you want early access to in-progress features and you accept that
things will sometimes be broken.
releases/vX.Y — Release Branches
When development for a minor version is feature-complete, a new long-lived
release branch (for example, releases/v2.1) is cut from main. This single
branch holds the entire life of that minor version:
- During the one-month QA window, the branch carries
vX.Y.Z-rcNtags as fixes land — these are release candidates, not final releases. - Once QA signs off, a final
vX.Y.0tag is cut on the same branch. - After GA, the branch continues to host patch releases (
vX.Y.1,vX.Y.2, …) as they are tagged.
The branch itself never carries an -rc suffix — only the tags on it do.
The latest non--rc tag on this branch is what most users should deploy.
See Tag Naming below.
Tag Naming
NICo uses semantic versioning of the form vX.Y.Z:
X— major versionY— minor versionZ— patch version
The following tag forms appear in the repository:
vX.Y.0— A minor release. Published as a GitHub release fromreleases/vX.Y.vX.Y.Z(whereZ > 0) — A patch release on top ofvX.Y.0. Also published as a GitHub release fromreleases/vX.Y.vX.Y.Z-rcN(e.g.v2.1.0-rc1,v2.1.0-rc2,v2.1.5-rc1) — A release candidate. Applied to commits onreleases/vX.Yduring the QA window for whichever release is being prepared (initial.0or a later patch). All four elements — major, minor, patch, and RC number — are always present. Patch releases do not get their own branch — they live on the samereleases/vX.Ybranch and are distinguished only by the tag. So the first release candidate forv2.1.5is taggedv2.1.5-rc1onreleases/v2.1, and the final tag (once QA signs off) isv2.1.5.vX.Y.Z-pr(always withZ = 0, e.g.v2.2.0-pr) — Applied tomainimmediately after a release branch is cut, to indicate thatmainis now the prerelease for the next minor version. All three numeric elements are present for consistency with-rcNtags. For example, the dayreleases/v2.1is cut,mainis taggedv2.2.0-pr, signaling thatmainis now pre-v2.2.0.
Release Cadence
NICo follows a fixed monthly cadence with a one-month QA window.
We also aim to avoid releases during major US and international holiday periods — including, but not limited to, the late-December/early-January end-of-year break, US Thanksgiving week, Lunar New Year, and Diwali — out of respect for the work/life balance of contributors and operators who observe them. When the published schedule would otherwise land a release inside one of these windows, the release is rescheduled to the next practical date.
Three-Month Rolling Roadmap
NICo maintains a three-month rolling roadmap alongside the monthly release cadence. The roadmap gives contributors, QA, and operators a current view of the next three planned minor-release cycles, including:
- planned feature themes or notable work targeted for each minor release;
- expected code-complete dates, QA windows, and final release targets;
- known schedule risks, dependency risks, or holiday-window adjustments; and
- items that have moved into or out of a cycle since the previous update.
The roadmap is refreshed at least once per month, typically after the monthly branch cut and prerelease tag, so it always rolls forward to keep three months visible. It is planning guidance rather than a release guarantee: features may move between cycles as priorities change, QA findings emerge, or release dates are adjusted.
Minor Releases (X.Y.0)
Every month:
- Code complete (last day of each month): a new release branch (e.g.
releases/v2.1) is cut frommain. - Immediately after the cut,
mainis tagged withvX.(Y+1).0-prto mark the start of the next prerelease cycle onmain. - The release branch is stabilized and QA tested for one month. During
this window, release-candidate tags (e.g.
v2.1.0-rc1,v2.1.0-rc2, …) are applied to commits on the branch as QA cycles through them. - Final minor release (last day of the following month): when QA signs
off, a
vX.Y.0tag is cut on the samereleases/vX.Ybranch and published as a GitHub release.
In short: minor releases ship one month after code complete.
Patch Releases (X.Y.Z)
Patch releases happen on the releases/vX.Y branch after the corresponding
vX.Y.0 has shipped. They are cut as needed — primarily for critical bug
fixes (data loss, security, production-blocking regressions) or significant
issues that cannot wait for the next minor release. There is no fixed patch
cadence; patches ship when the fixes warrant them.
Patch releases go through their own QA window, scoped to the changes being shipped. The mechanics are the same as for a minor release but use a patch-versioned RC tag:
- Candidate commits are tagged on
releases/vX.YasvX.Y.Z-rcN(e.g.v2.1.5-rc1,v2.1.5-rc2). - QA executes the relevant test plans against the RC tag.
- Once QA signs off, the final
vX.Y.Ztag is cut on the same branch.
Note that patch releases do not get their own branch. All v2.1.* work
lives on releases/v2.1; only the tags distinguish a patch RC from the
final patch release. Each final patch release is published on GitHub with
a vX.Y.Z tag.
Which Version Should I Use?
Bugs found on a tagged release (vX.Y.Z with no -rc or -pr suffix) are
treated with the highest priority and are tracked as QA test escapes —
defects that slipped past the QA window and require a follow-up fix, typically
in the next patch release.
Support Policy
At any point in time, exactly three minor releases are visible to users, each in a different support tier. The tiers shift forward by one slot each time a new minor release passes QA.
A note on terminology. The middle tier is called Maintenance in this document. The user-supplied draft of this policy used the word deprecated; we use Maintenance instead because it’s the more standardized industry term for “still supported, but on a higher bar for changes” (e.g. Kubernetes and PostgreSQL community releases). Deprecated in most ecosystems implies “scheduled for removal,” which is closer to what we mean by EOL.
Tier Transitions
When release vX.Y passes QA and becomes Current:
- The release that was Current (
vX.(Y-1)) moves to Maintenance. - The release that was Maintenance (
vX.(Y-2)) moves to EOL and stops receiving fixes. - The newly Current release (
vX.Y) begins accepting patch releases under the normal bar.
Because the monthly cadence is fixed, each minor release spends roughly one month as Current, one month as Maintenance, and is then EOL.
Fix Backporting
NICo uses a four-level severity scheme aligned with common industry practice (see, for example, the Kubernetes patch-release criteria and the CVSS v3.1 severity ratings for security issues):
A change is anything that is not just a bug fix: new APIs, new fields, new flags, new dependencies, version bumps of major dependencies, refactors, performance improvements that are not fixing a regression, etc.
The bars below apply on top of these definitions:
- Current — “normal bar.” Accepts Critical, High, and Medium bug fixes,
shipped via patch releases (
vX.Y.Z). Low-severity fixes are accepted when they are low-risk; they may also be deferred to the next minor release. New feature work does not land on Current — features land inmainand ship in the next minor release. Small, low-risk changes (for example, a one-line configuration option or a clearer error message) may occasionally land alongside fixes when their value clearly outweighs the risk of destabilizing a supported release; this is the exception, not the rule. - Maintenance — “higher bar.” Accepts Critical and High only. Medium- and Low-severity bug fixes are not backported, and no changes (in the sense above) are accepted. The intent is to keep Maintenance releases as stable and predictable as possible: only fixes that would otherwise compel a user to upgrade are backported.
- EOL receives no fixes regardless of severity. Users on EOL releases should plan an upgrade.
When in doubt about whether a fix clears the Maintenance bar, default to “no” and link the original fix PR in a comment so the decision is auditable.
Upgrade and Downgrade Support
In other words, you may skip the Maintenance tier when upgrading from EOL straight to Current, but you may not move backward to an older minor (or to an older patch within the same minor). If a Current release introduces a problem that blocks you, the supported recovery is a forward-fix in the next patch release, not a downgrade.
Downgrade support is being tracked as a potential future capability in issue #2019 — feat: Need to be able to downgrade NICo versions; follow that issue for the latest state.
QA Workflow
NICo’s QA process is tracked entirely in GitHub Issues, using the NVIDIA Infra Controller GitHub Project. Every issue carries two relevant fields:
Status— the overall lifecycle of the issue (dev side).QA Test Status— the QA-side lifecycle.
Ground Rules
- Every issue is expected to have a
QA Test Status. Even issues that turn out to need no testing must be markedQA Not Required— there is no “unset” outcome. Today this field is set manually; there is no automation that initializes it on issue creation. - Every PR is expected to have at least one linked issue. Use GitHub’s
Fixes #N/Closes #N/Resolves #Nkeywords, or attach the PR to the issue from the issue’s sidebar. Code changes without a linked issue should not merge. - QA decides what testing is needed, not engineering. Engineers should not
pre-set
QA Not Requiredor otherwise short-circuit the QA triage process. The QA team owns triage and scoping; engineering owns the fix and the test-plan dev signoff. - Merging a PR does not close its linked issue(s). When a PR merges, the
linked issue should move to
Status: VerifywithDisposition: Item Completed— meaning the code is complete and the issue is now ready for QA to test. Closure happens only after QA passes. This is being automated via PR #2584 — ci: complete linked issues on merged PRs; the rest of this document assumes that automation is in place.
QA Test Status Values and Transitions
The QA Test Status field walks roughly like this:
QA to triage— Default starting state. QA reviews the issue and decides what (if any) testing is required.QA Need Info— QA needs clarification from the reporter or the engineer before they can scope the work. Returns to triage or test design once answered.QA Not Required— Terminal QA state. The issue still proceeds through the normal dev workflow (In Progress→Verify | Item Completed→Closed); QA simply does not gate it. Used for internal refactors, dev-only tooling, doc-only changes, etc.QA Test Design— QA owns the issue and is writing the test plan. TheQA Engineerfield is assigned at this point.Dev Signoff Required— QA has drafted a test plan and is asking the responsible engineer to confirm that it correctly covers the change.Test Plan Rework Required— Dev pushed back on the test plan. Returns toQA Test Designfor revision.Test Plan Approved— Dev has signed off. The plan is ready to be executed once the fix lands.QA Execution— QA is actively running the approved test plan, typically after the linked PR(s) have merged and the issue has moved toStatus: Verify | Item Completed.QA Passed— All tests passed. The issue can move toStatus: Closed.QA Failed— Tests failed. The issue goes back to engineering for a fix; after the fix is merged it returns directly toQA Execution(the test plan itself does not need to be re-designed unless the failure reveals a gap in the plan).
How QA Test Status Relates to Issue Status
The two fields move semi-independently:
- While dev is still working,
StatusisIn ProgressandQA Test Statusis typically somewhere in the triage/test-design/signoff portion of its track. - When the PR merges, the linked issue’s
Statusflips toVerify | Item Completed(via the automation in PR #2584), signaling to QA that the fix is code-complete and ready to be exercised.QA Test Statusis expected to beTest Plan Approved(or already inQA Execution) by this point. - Once
QA Test StatusbecomesQA Passed, the issue’sStatusis moved toClosedwithDisposition: Item Completed. - If
QA Test StatusbecomesQA Failed, the issue typically moves back toStatus: In Progressso engineering can address the failure.
Roles
- Engineer — writes the fix, links the PR to the issue, reviews the QA
test plan when asked (
Dev Signoff Required), and addresses anyQA Failedoutcomes. - QA Engineer (set via the
QA Engineerfield, assigned atQA Test Design) — owns triage, test plan authoring, execution, and the final pass/fail call.
Backward Compatibility
Breaking changes are not allowed anywhere in the codebase for anything that falls under our API guarantees.
Deprecation and Breaking-Change Notice
Guaranteed public APIs may be deprecated before a future breaking change, but deprecation is a warning, not removal. A deprecated guaranteed API must remain functional for the rest of the current major version.
Removal of, or an incompatible behavior change to, a guaranteed public API is allowed only in a future major release. Any such change must be announced in the release notes and the three-month rolling roadmap, and should include a replacement or migration path when one exists.
When practical, deprecated public APIs should also produce an operator-visible warning, such as an API warning, CLI warning, log message, or release-note callout.
The minimum notice period for a breaking change to a guaranteed public API is one full three-month rolling-roadmap window before the first release that removes or changes it incompatibly. Emergency exceptions for security, data corruption, or similarly severe issues must be called out explicitly in the release notes.
This notice policy applies only to the guaranteed surfaces below. Internal APIs and storage formats listed under What Is Explicitly Not Guaranteed may change between releases.
What Is Guaranteed to Remain Backward Compatible
- The NICo REST API.
- The NICo CLI (
nicocli) — command names, arguments, flags, values, and exit codes. - Configuration file structures — keys, values, filenames, and locations.
- Environment variable names and values consumed by NICo components.
If you depend on any of the above, you can rely on them not changing incompatibly within and across releases.
What Is Explicitly Not Guaranteed
The following are considered internal and may change without notice between releases:
- The gRPC API and protobuf message contents.
- The admin CLI (also referred to as the debug CLI) — a lower-level tool intended for operators and developers, not end users.
- The admin UI (also referred to as the debug UI) — same audience as the admin CLI.
- The Vault data model — how secrets are laid out inside HashiCorp Vault.
- The PostgreSQL database schema used by NICo services. See issue #2019 for the current state of this guarantee (tracked alongside downgrade support, which depends on it).
- Any other internal API contract between NICo services, or persistent data formats used only by NICo itself.
If you build automation that depends on any of the unguaranteed items above, expect to update it across NICo releases.
Glossary
A few terms used on this page that may not be obvious:
- Code complete — the point in the cycle at which feature work for a minor
version stops and stabilization begins. On this date, the release branch is
cut from
main. - Release candidate (rc) — a tagged build on a
releases/vX.Ybranch that is a candidate for release, pending QA sign-off. Identified by the-rcNsuffix on the tag (e.g.v2.1.0-rc1,v2.1.5-rc2). Note:-rcis a tag suffix only; there is noreleases/...-rcbranch. - Prerelease (pr) — a build of
mainthat is on its way to becoming the next minor release. Identified by the-prsuffix on a tag (e.g.v2.2.0-pr). - QA sign-off — the formal acknowledgment from QA that a release candidate has passed its test plan and may be promoted to a final release.
- QA test escape — a defect discovered in a tagged, signed-off release that was not caught during the QA window. These are treated as high-priority and typically fixed in a subsequent patch release.
- Semver — semantic versioning, the
vX.Y.Zscheme used by NICo whereXis major,Yis minor, andZis patch. - Test plan — the set of test cases QA writes for a given issue during
QA Test Design. The plan is what gets dev-signed-off and then executed inQA Execution. - Disposition — the GitHub project field that records why an issue was
closed (e.g.
Item Completed,Cannot reproduce,Will not fix,Behaves Correctly,Not a bug). Independent ofQA Test Status. - Current — the most recent GA minor release. Receives bug fixes under the normal bar via patch releases.
- Maintenance — the minor release one version behind Current. Still supported, but only for fixes meeting a higher bar (regressions, security fixes, critical blockers).
- End-of-Life (EOL) — the minor release two versions behind Current. No longer receives fixes. Users should upgrade to Maintenance or Current.
- Three-month rolling roadmap — a planning view of the next three planned minor-release cycles. It is refreshed monthly and used for coordination, not as a release guarantee.