Megatron Bridge CI Test System#
Directory Layout#
tests/functional_tests/launch_scripts/
h100/
active/ # H100 tests that run in CI automatically
flaky/ # H100 tests quarantined from blocking CI
gb200/
active/ # GB200 tests that run in CI automatically
flaky/ # GB200 tests quarantined from blocking CI
Scripts are named {Tier}_{Description}.sh (e.g., L0_Launch_training.sh).
Unit tests live under tests/unit_tests/ and are independent of this layout.
Tier Semantics#
Tier |
Trigger |
Blocking |
|---|---|---|
L0 |
Every PR, every push to |
Yes β PR cannot merge if L0 fails |
L1 |
Push to |
Yes |
L2 |
Schedule and |
Yes (when triggered) |
flaky |
|
No β failures are informational |
H100 and GB200 each have their own L0/L1/L2/flaky jobs. Moving a test to flaky removes it from blocking CI on both hardware targets independently.
Script Conventions#
Every launch script must start with:
# CI_TIMEOUT=<minutes>
This is parsed by the matrix generator to set the job timeout. If the header is absent, the default is 30 minutes.
The tier prefix in the filename (L0_, L1_, L2_) controls which matrix the
script is included in. The matrix generator globs {tier}_*.sh from
{h100,gb200}/active/.
Adding a New Test#
Create the script under the appropriate
active/directory (or both if the test should run on both H100 and GB200).Start the file with
# CI_TIMEOUT=<minutes>.Name the file
{Tier}_{CamelDescription}.sh.Make it executable:
chmod +x <file>.
No workflow changes are needed β the matrix is generated dynamically by scanning the directory.
Moving a Test to Flaky#
Use git mv to relocate the script from active/ to flaky/:
# H100
git mv tests/functional_tests/launch_scripts/h100/active/L0_Foo.sh \
tests/functional_tests/launch_scripts/h100/flaky/L0_Foo.sh
# GB200 (if the test also exists there)
git mv tests/functional_tests/launch_scripts/gb200/active/L0_Foo.sh \
tests/functional_tests/launch_scripts/gb200/flaky/L0_Foo.sh
Flaky tests still run on manual dispatches (test_suite=all) so failures
remain visible β they just donβt block PRs. Move back to active/ once the
underlying issue is fixed.
Removing a Test#
Delete the script file and commit. No other changes required.
CI Job Reference#
GitHub Actions job |
Hardware |
Directory scanned |
|---|---|---|
|
H100 |
|
|
H100 |
|
|
H100 |
|
|
H100 |
|
|
GB200 |
|
|
GB200 |
|
|
GB200 |
|
|
GB200 |
|
Hardware runners: H100 uses nemo-ci-{azure,aws}-gpu-x2; GB200 uses
nemo-ci-gcp-gpu-x2.
Code Anchors#
Component |
Path |
|---|---|
Matrix generation (H100) |
|
Matrix generation (GB200) |
|
Test runner action |
|
Launch scripts root |
|