Developer Guide#
This guide covers the recommended development workflow for Megatron Bridge. Two core principles apply everywhere: build and develop inside containers, and always use uv for package management.
Why Containers#
Megatron Bridge depends on CUDA, NCCL, PyTorch with GPU support, Transformer Engine, and optional components like TRT-LLM, vLLM, and DeepEP. Installing these on a bare host is fragile and hard to reproduce. The project ships production-quality Dockerfiles that pin every dependency.
Use the container as your development environment. This guarantees:
Identical CUDA / NCCL / cuDNN versions across all developers and CI.
uv.lockresolves the same way locally and in CI (the lockfile is Linux-only; it cannot be regenerated on macOS).GPU-dependent operations (training, conversion,
uv lock) work out of the box.
Option 1: Use the NeMo Framework Container#
The fastest way to get started is the pre-built NeMo Framework container, which ships with Megatron Bridge, Megatron-Core, and all GPU dependencies pre-installed. No build step required:
docker run --rm -it --gpus all --shm-size=24g \
nvcr.io/nvidia/nemo:latest \
bash
Option 2: Build the Megatron Bridge Container#
If you need to test against your local source tree, build the image from the repository root:
docker build \
-f docker/Dockerfile.ci \
--target megatron_bridge \
-t megatron-bridge:latest \
.
This builds the CI image with all dependencies installed via uv sync --locked.
See docker/README.md for the full NeMo Framework image stack
(fw-base -> megatron-bridge -> fw-final) and build argument reference.
Key build args:
BASE_IMAGEβ base PyTorch image (default:nvcr.io/nvidia/pytorch:26.02-py3)MCORE_TRIGGERED_TESTINGβ set totruewhen testing against a non-pinned MCore commitUV_CACHE_PRUNE_ARGSβ optional args passed touv cache pruneduring image build
Running the Container#
Interactive development shell:
docker run --rm -it -w /opt/Megatron-Bridge \
-v $(pwd):/opt/Megatron-Bridge \
--gpus all \
--shm-size=24g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
megatron-bridge:latest \
bash
Containers on Slurm Clusters#
On Slurm clusters with Enroot/Pyxis, containers are passed to srun directly:
srun --mpi=pmix \
--container-image="$CONTAINER_IMAGE" \
--container-mounts="$CONTAINER_MOUNTS" \
--no-container-mount-home \
bash -c "cd /opt/Megatron-Bridge && uv run --no-sync python ..."
If you use the built container (or the NeMo Framework container) as-is,
dependencies are already installed and no uv sync is needed. If you
bind-mount a custom Megatron Bridge source tree into the container
(e.g., for development), you need to uv sync so dependencies match
your local pyproject.toml and uv.lock. In that case, only rank 0
should sync while other ranks wait:
if [ "$SLURM_LOCALID" -eq 0 ]; then uv sync; else sleep 10; fi
Other key points:
--no-container-mount-homeis an srun flag, not an#SBATCHdirective.Set
UV_CACHE_DIRto shared storage to avoid filling the containerβs/root/.cache/.
Always Use uv#
Megatron Bridge uses uv as its sole package
manager. The uv.lock file is checked into the repository for reproducible
builds. Never use pip install, conda, or bare python β always go
through uv.
Never install or upgrade dependencies outside the CI container. All uv
commands must be run inside a megatron-bridge container β either one you
built locally or a pre-built image.
Why uv#
Reproducibility:
uv.lockpins every transitive dependency, ensuring identical environments across developers, CI, and production containers.Speed: uv resolves and installs dependencies 10-100x faster than pip.
Single tool: uv handles virtual environments, dependency resolution, locking, syncing, and running scripts β no need for separate tools.
CI integration:
Dockerfile.ciinstalls everything viauv sync --locked. If you use pip to install something locally, it will diverge from what CI tests against.Cache-friendly: Set
UV_CACHE_DIRto a persistent host directory and mount it into the container to avoid re-downloading wheels on everydocker run. This is especially useful when you mount a frequently changing workdir that triggers re-syncs:docker run --rm -it \ -v $(pwd):/opt/Megatron-Bridge \ -v $HOME/.cache/uv:/root/.cache/uv \ --gpus all --shm-size=24g \ megatron-bridge:latest bash
Essential uv Commands#
Task |
Command |
|---|---|
Install all deps from lockfile |
|
Install with all extras and dev groups |
|
Run a Python command |
|
Run training |
|
Add a new dependency |
|
Add an optional dependency |
|
Regenerate the lockfile |
|
Run linting |
|
Install pre-commit hooks |
|
uv run, Not bare python#
Always launch scripts with uv run:
# Correct
uv run python -m torch.distributed.run --nproc_per_node=1 scripts/training/run_recipe.py ...
# Wrong β bypasses the uv-managed environment
python -m torch.distributed.run --nproc_per_node=1 scripts/training/run_recipe.py ...
torchrun --nproc_per_node=1 scripts/training/run_recipe.py ...
After running uv sync inside a container, you can also use bare python
since the virtual environment is already activated. But uv run is always the
safer default.
Adding Dependencies#
uv add some-package
# For an optional extra group (e.g., trtllm-specific deps)
uv add --optional --extra trtllm some-package
This updates pyproject.toml and uv.lock. Commit both files:
git add pyproject.toml uv.lock
git commit -s -m "build: add some-package dependency"
Regenerating uv.lock#
The lockfile is Linux-only (it resolves against CUDA wheels). You cannot
regenerate it on macOS. Run uv lock inside the Docker container or on a
Linux workstation:
docker run --gpus all --rm \
-v $(pwd):/opt/Megatron-Bridge \
megatron-bridge:latest \
bash -c 'cd /opt/Megatron-Bridge && uv lock'
uv sync After Switching MCore Branches#
The lockfile is generated against the main MCore commit. When switching to the dev branch:
./scripts/switch_mcore.sh dev
uv sync # without --locked
When switching back to main:
./scripts/switch_mcore.sh main
uv sync --locked # lockfile matches again
Pre-commit Hooks#
Install pre-commit hooks before your first commit:
uv run --group dev pre-commit install
The hooks run ruff for linting and formatting, plus end-of-file and trailing-whitespace fixers. If hooks auto-fix files, re-stage and re-run:
git add -u
pre-commit run
# If it auto-fixed files:
git add -u
pre-commit run
Repeat until all hooks pass.
Before committing, you can also run linting manually:
ruff check --fix <changed_files>
ruff format <changed_files>
pre-commit run --all-files
Running Tests#
Tests live under tests/:
Path |
Description |
|---|---|
|
Fast, isolated unit tests grouped by domain (models, core, data, etc.) |
|
Integration tests with models/datasets, tiered L0/L1/L2 |
Pytest markers available: unit, integration, system, acceptance, docs, skipduringci, pleasefixme
Unit Tests#
uv run pytest tests/unit_tests/ -x -v
Unit tests run without GPUs and do not depend on large artifacts. Or inside Docker:
docker run --rm --gpus all -v $(pwd):/workdir/ -w /workdir/ megatron-bridge \
uv run pytest tests/unit_tests/
Functional Tests#
Functional tests require GPUs and are typically run inside the container:
uv run pytest tests/functional_tests/ -x -v
Longer functional tests use L2_Launch_*.sh launcher scripts in
tests/functional_tests/. Each launcher must be registered in
.github/workflows/cicd-main.yml under matrix.include to be picked up
by CI.
Adding a Unit Test#
Place it under
tests/unit_tests/<domain>/test_<name>.py.Use the appropriate pytest marker:
@pytest.mark.unit.Run locally:
uv run --no-sync --active pytest tests/unit_tests/<your_test>.py
Adding a Functional Test#
Create a launch script under
tests/functional_tests/launch_scripts/active/.Follow the naming convention:
L0_Launch_<area>_<desc>.sh,L1_Launch_..., orL2_Launch_....Tier guidance:
L0 β smoke tests that run on every PR; must be fast and stable.
L1 β broader coverage; runs nightly.
L2 β heavy tests (large models, checkpoint conversion); runs on schedule or manual trigger.
Apply the
needs-more-testsPR label to trigger L0 + L1 for a PR.
Commit and PR Workflow#
Never commit directly to
mainβ always create a feature branch.Always sign commits:
git commit -s -m "message".PR title format:
[{areas}] {type}: {description}(e.g.,[model] feat: Add Qwen3 model bridge).Trigger CI: Comment
/ok to test <commit-sha>on the PR, or set up signed commits for automatic CI triggering.
See CONTRIBUTING.md for the full PR workflow, area/type labels, and DCO
requirements.
CI Pipeline#
The CI pipeline is defined in .github/workflows/cicd-main.yml. It is
triggered by schedule, pushes to main, deploy-release/*, and
pull-request/<number> branches, merge groups, and workflow_dispatch.
Pipeline Structure#
pre-flight
βββ lint-check
βββ cicd-wait-in-queue # requires maintainer approval for untrusted PRs
βββ cicd-container-build # builds and caches the Docker image
βββ unit-tests-core
βββ unit-tests-diffusion
βββ functional-tests (L0 always; L1 with needs-more-tests label; L2 on schedule)
The CI branch
pull-request/<number>is created automatically when a PR is opened againstmainordeploy-release/*.Concurrent runs for the same PR are cancelled automatically (concurrency group per PR number).
Slack notifications are sent on completion for scheduled and nightly runs.
CI Failure Investigation#
For PR-scoped CI runs, branches follow the pattern pull-request/<number>.
This workflow can also be triggered by schedule, push to main/deploy-release/*, and workflow_dispatch.
Locating the PR from a CI Branch#
# Extract PR number from the CI branch name (e.g. pull-request/1234)
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')
# Or, given a branch name string directly:
PR_NUMBER=$(echo "pull-request/1234" | grep -oP '(?<=pull-request/)\d+')
# Fetch PR metadata
gh pr view "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge
# List files changed in the PR
gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge --name-only
# View PR checks / CI status
gh pr checks "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge
Investigating a Failing CI Job#
Get the PR number from the branch name (see above).
Review the changeset to understand what changed:
gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge
Identify the failing job from
gh pr checksoutput or from the GitHub Actions URL in the failure notification.Fetch job logs for deeper inspection:
# List runs for the PR's head SHA gh run list --repo NVIDIA-NeMo/Megatron-Bridge --branch "pull-request/$PR_NUMBER" # Download logs for a specific run to a local file gh run view <run_id> --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > run.log
Scan the log file in chunks. Log files can exceed 10,000 lines β never load them whole into context. Read them in chunks of ~200 lines and stop as soon as the root cause is found:
# Total line count wc -l run.log # Read chunk N (lines 1β200, 201β400, β¦) sed -n '1,200p' run.log sed -n '201,400p' run.log # β¦ continue until the failure is located
Scan from the end first if looking for the final error, then work backwards:
# Last 200 lines tail -200 run.log
Cross-reference the changeset against the failing test or step to narrow down the root cause.
Common Failure Patterns#
Symptom |
Likely Cause |
Action |
|---|---|---|
Lint job fails |
|
Run |
Container build fails |
Dependency conflict or stale |
Re-run |
Unit tests fail |
Code regression or missing import |
Run failing test locally; check the PR diff for the relevant module |
Functional test (L0) fails |
Integration breakage |
Check GPU runner logs; reproduce with the corresponding |
|
PR not yet approved for CI |
A maintainer must comment |
MCore submodule mismatch |
Pinned commit out of sync |
Update |
Common Pitfalls#
Problem |
Cause |
Fix |
|---|---|---|
|
Lockfile resolves CUDA wheels that donβt exist on macOS |
Run inside Docker or on a Linux machine |
|
pip installed outside the uv-managed venv |
Use |
|
Lockfile was generated against main MCore |
Use |
Stale checkpoint auto-resume in Bridge |
|
|
Port collision on Slurm (EADDRINUSE) |
|
Drop torchrun; use |
|
Container doesnβt have uv |
Use the |
|
Cache fills containerβs |
Set |
Pre-commit fails with ruff errors |
Code style violations |
Run |
Quick Start Checklist#
Clone the repo and initialize submodules:
git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge megatron-bridge cd megatron-bridge git submodule update --init 3rdparty/Megatron-LM
Build the container:
docker build -f docker/Dockerfile.ci --target megatron_bridge -t megatron-bridge:latest .
Start a dev shell:
docker run --rm -it -v $(pwd):/opt/Megatron-Bridge --gpus all --shm-size=24g megatron-bridge:latest bash
Install pre-commit hooks (inside container):
uv run --group dev pre-commit install
Run a quick training sanity check:
uv run python -m torch.distributed.run --nproc_per_node=1 \ scripts/training/run_recipe.py \ --recipe vanilla_gpt_pretrain_config \ train.train_iters=5 train.global_batch_size=8 train.micro_batch_size=4 \ scheduler.lr_warmup_iters=1 scheduler.lr_decay_iters=5 \ logger.log_interval=1
Create a branch, make changes, and submit a PR:
git switch -c your-feature-name # ... make changes ... git add -u && git commit -s -m "[area] type: description" git push origin your-feature-name