Testing#
Directory Layout#
tests/
unit_tests/ # fast, isolated, no GPU required
functional_tests/
launch_scripts/
h100/
active/ # H100 tests that run in CI automatically
flaky/ # H100 tests quarantined from blocking CI
gb200/
active/ # GB200 tests that run in CI automatically
flaky/ # GB200 tests quarantined from blocking CI
Unit tests are independent of the launch script layout. Functional test
scripts are named {Tier}_{Description}.sh (e.g., L0_Launch_training.sh).
Tier Semantics#
Tier |
Trigger |
Blocking |
|---|---|---|
L0 |
Every PR, every push to |
Yes — PR cannot merge if L0 fails |
L1 |
Push to |
Yes |
L2 |
Schedule, |
Yes (when triggered) |
flaky |
|
No — failures are informational |
H100 and GB200 each have independent L0/L1/L2/flaky jobs. Moving a script to
flaky/ removes it from blocking CI on that hardware target only.
Prefer unit tests over functional tests. CI GPU resources are limited; every functional test slot has a real cost.
Running Tests Locally#
Unit Tests#
No GPU required:
uv run pytest tests/unit_tests/ -x -v
Or inside Docker:
docker run --rm --gpus all -v $(pwd):/workdir/ -w /workdir/ megatron-bridge \
uv run pytest tests/unit_tests/
Functional Tests#
Run the corresponding launch script directly on a GPU node:
bash tests/functional_tests/launch_scripts/h100/active/L0_Launch_training.sh
Adding a Unit Test#
Place the file under
tests/unit_tests/<domain>/test_<name>.py.Mark it:
@pytest.mark.unit.Keep configs tiny: small hidden dims, 1-2 layers, short sequences.
Run locally:
uv run python -m pytest tests/unit_tests/<your_test>.py
No foreign setattr on config dataclasses. When applying overrides via
setattr(config_obj, key, value), always guard first:
if not hasattr(config_obj, key):
raise ValueError(f"Config has no field '{key}'")
setattr(config_obj, key, value)
Setting a non-existent attribute silently creates a phantom field — the test passes but the recipe fails for a real user.
Adding a Functional Test#
Create the script under
tests/functional_tests/launch_scripts/{h100,gb200}/active/.Start the file with a timeout header:
# CI_TIMEOUT=<minutes>Name it
{Tier}_{CamelDescription}.sh— the tier prefix controls which CI matrix includes it.Make it executable:
chmod +x <file>.Functional tests must use at most 2 GPUs.
No workflow file changes needed — the matrix is generated dynamically by scanning the directory.
Moving a Test to Flaky#
# H100
git mv tests/functional_tests/launch_scripts/h100/active/L0_Foo.sh \
tests/functional_tests/launch_scripts/h100/flaky/L0_Foo.sh
# GB200 (if the test also exists there)
git mv tests/functional_tests/launch_scripts/gb200/active/L0_Foo.sh \
tests/functional_tests/launch_scripts/gb200/flaky/L0_Foo.sh
Flaky tests still run on manual dispatches (test_suite=all) so failures
remain visible. Move back to active/ once the underlying issue is fixed.
Removing a Test#
Delete the script file and commit. No other changes required.
Pytest Conventions#
Use pytest fixtures for common setup.
Available markers:
unit,integration,system,acceptance,docs,skipduringci,pleasefixme.Functional tests are capped at 2 GPUs. Set
CUDA_VISIBLE_DEVICESexplicitly for multi-GPU tests.Use
uv run python -m pytest, never barepytest.
CI Job Reference#
GitHub Actions job |
Hardware |
Directory scanned |
|---|---|---|
|
H100 |
|
|
H100 |
|
|
H100 |
|
|
H100 |
|
|
GB200 |
|
|
GB200 |
|
|
GB200 |
|
|
GB200 |
|
Hardware runners: H100 uses nemo-ci-{azure,aws}-gpu-x2; GB200 uses nemo-ci-gcp-gpu-x2.
Code Anchors#
Component |
Path |
|---|---|
Matrix generation (H100) |
@.github/workflows/cicd-main.yml job |
Matrix generation (GB200) |
@.github/workflows/cicd-main.yml job |
Test runner action |
@.github/actions/test-template/action.yml |
Launch scripts root |
|