Developer Guide#

This guide covers the recommended development workflow for Megatron Bridge. Two core principles apply everywhere: build and develop inside containers, and always use uv for package management.


Why Containers#

Megatron Bridge depends on CUDA, NCCL, PyTorch with GPU support, Transformer Engine, and optional components like TRT-LLM, vLLM, and DeepEP. Installing these on a bare host is fragile and hard to reproduce. The project ships production-quality Dockerfiles that pin every dependency.

Use the container as your development environment. This guarantees:

  • Identical CUDA / NCCL / cuDNN versions across all developers and CI.

  • uv.lock resolves the same way locally and in CI (the lockfile is Linux-only; it cannot be regenerated on macOS).

  • GPU-dependent operations (training, conversion, uv lock) work out of the box.

Option 1: Use the NeMo Framework Container#

The fastest way to get started is the pre-built NeMo Framework container, which ships with Megatron Bridge, Megatron-Core, and all GPU dependencies pre-installed. No build step required:

docker run --rm -it --gpus all --shm-size=24g \
  nvcr.io/nvidia/nemo:latest \
  bash

Option 2: Build the Megatron Bridge Container#

If you need to test against your local source tree, build the image from the repository root:

docker build \
  -f docker/Dockerfile.ci \
  --target megatron_bridge \
  -t megatron-bridge:latest \
  .

This builds the CI image with all dependencies installed via uv sync --locked. See docker/README.md for the full NeMo Framework image stack (fw-base -> megatron-bridge -> fw-final) and build argument reference.

Running the Container#

Interactive development shell:

docker run --rm -it -w /opt/Megatron-Bridge \
  -v $(pwd):/opt/Megatron-Bridge \
  --gpus all \
  --shm-size=24g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  megatron-bridge:latest \
  bash

Containers on Slurm Clusters#

On Slurm clusters with Enroot/Pyxis, containers are passed to srun directly:

srun --mpi=pmix \
  --container-image="$CONTAINER_IMAGE" \
  --container-mounts="$CONTAINER_MOUNTS" \
  --no-container-mount-home \
  bash -c "cd /opt/Megatron-Bridge && uv run --no-sync python ..."

If you use the built container (or the NeMo Framework container) as-is, dependencies are already installed and no uv sync is needed. If you bind-mount a custom Megatron Bridge source tree into the container (e.g., for development), you need to uv sync so dependencies match your local pyproject.toml and uv.lock. In that case, only rank 0 should sync while other ranks wait:

if [ "$SLURM_LOCALID" -eq 0 ]; then uv sync; else sleep 10; fi

Other key points:

  • --no-container-mount-home is an srun flag, not an #SBATCH directive.

  • Set UV_CACHE_DIR to shared storage to avoid filling the container’s /root/.cache/.


Always Use uv#

Megatron Bridge uses uv as its sole package manager. The uv.lock file is checked into the repository for reproducible builds. Never use pip install, conda, or bare python — always go through uv.

Why uv#

  • Reproducibility: uv.lock pins every transitive dependency, ensuring identical environments across developers, CI, and production containers.

  • Speed: uv resolves and installs dependencies 10-100x faster than pip.

  • Single tool: uv handles virtual environments, dependency resolution, locking, syncing, and running scripts — no need for separate tools.

  • CI integration: Dockerfile.ci installs everything via uv sync --locked. If you use pip to install something locally, it will diverge from what CI tests against.

  • Cache-friendly: Set UV_CACHE_DIR to a persistent host directory and mount it into the container to avoid re-downloading wheels on every docker run. This is especially useful when you mount a frequently changing workdir that triggers re-syncs:

    docker run --rm -it \
      -v $(pwd):/opt/Megatron-Bridge \
      -v $HOME/.cache/uv:/root/.cache/uv \
      --gpus all --shm-size=24g \
      megatron-bridge:latest bash
    

Essential uv Commands#

Task

Command

Install all deps from lockfile

uv sync --locked

Install with all extras and dev groups

uv sync --locked --all-extras --all-groups

Run a Python command

uv run python script.py

Run training

uv run python -m torch.distributed.run --nproc_per_node=N script.py

Add a new dependency

uv add <package>

Add an optional dependency

uv add --optional --extra <group> <package>

Regenerate the lockfile

uv lock (must be done inside the container on Linux)

Run linting

uv run ruff check --fix . && uv run ruff format .

Install pre-commit hooks

uv run --group dev pre-commit install

uv run, Not bare python#

Always launch scripts with uv run:

# Correct
uv run python -m torch.distributed.run --nproc_per_node=1 scripts/training/run_recipe.py ...

# Wrong — bypasses the uv-managed environment
python -m torch.distributed.run --nproc_per_node=1 scripts/training/run_recipe.py ...
torchrun --nproc_per_node=1 scripts/training/run_recipe.py ...

After running uv sync inside a container, you can also use bare python since the virtual environment is already activated. But uv run is always the safer default.

Adding Dependencies#

uv add some-package

# For an optional extra group (e.g., trtllm-specific deps)
uv add --optional --extra trtllm some-package

This updates pyproject.toml and uv.lock. Commit both files:

git add pyproject.toml uv.lock
git commit -s -m "build: add some-package dependency"

Regenerating uv.lock#

The lockfile is Linux-only (it resolves against CUDA wheels). You cannot regenerate it on macOS. Run uv lock inside the Docker container or on a Linux workstation:

docker run --gpus all --rm \
  -v $(pwd):/opt/Megatron-Bridge \
  megatron-bridge:latest \
  bash -c 'cd /opt/Megatron-Bridge && uv lock'

uv sync After Switching MCore Branches#

The lockfile is generated against the main MCore commit. When switching to the dev branch:

./scripts/switch_mcore.sh dev
uv sync            # without --locked

When switching back to main:

./scripts/switch_mcore.sh main
uv sync --locked   # lockfile matches again

Pre-commit Hooks#

Install pre-commit hooks before your first commit:

uv run --group dev pre-commit install

The hooks run ruff for linting and formatting, plus end-of-file and trailing-whitespace fixers. If hooks auto-fix files, re-stage and re-run:

git add -u
pre-commit run
# If it auto-fixed files:
git add -u
pre-commit run

Repeat until all hooks pass.


Running Tests#

The recommended way to run the full test suite locally is scripts/run_ci_tests.sh, which mirrors the GitHub CI pipeline (lint, unit tests, functional tests, coverage):

bash scripts/run_ci_tests.sh                    # local mode, GPUs 0,1
bash scripts/run_ci_tests.sh --mode docker       # build container and run inside it
bash scripts/run_ci_tests.sh --gpus 0 --skip-functional  # unit + lint only on GPU 0
bash scripts/run_ci_tests.sh --skip-lint --skip-unit      # functional tests only

You can also run individual test suites directly:

Unit Tests#

uv run pytest tests/unit_tests/ -x -v

Unit tests run without GPUs and do not depend on large artifacts.

Functional Tests#

Functional tests require GPUs and are typically run inside the container:

uv run pytest tests/functional_tests/ -x -v

Longer functional tests use L2_Launch_*.sh launcher scripts in tests/functional_tests/. Each launcher must be registered in .github/workflows/cicd-main.yml under matrix.include to be picked up by CI.


Commit and PR Workflow#

  • Never commit directly to main — always create a feature branch.

  • Always sign commits: git commit -s -m "message".

  • PR title format: [{areas}] {type}: {description} (e.g., [model] feat: Add Qwen3 model bridge).

  • Trigger CI: Comment /ok to test <commit-sha> on the PR, or set up signed commits for automatic CI triggering.

See CONTRIBUTING.md for the full PR workflow, area/type labels, and DCO requirements.


Common Pitfalls#

Problem

Cause

Fix

uv sync --locked fails on macOS

Lockfile resolves CUDA wheels that don’t exist on macOS

Run inside Docker or on a Linux machine

ModuleNotFoundError after pip install

pip installed outside the uv-managed venv

Use uv add and uv sync, never bare pip install

uv sync --locked fails after MCore branch switch

Lockfile was generated against main MCore

Use uv sync (without --locked) on dev

Stale checkpoint auto-resume in Bridge

nemo_experiments/ from a previous run exists

rm -rf nemo_experiments before starting fresh

Port collision on Slurm (EADDRINUSE)

ntasks-per-node=8 with torchrun --nproc_per_node=8

Drop torchrun; use ntasks-per-node=8 with uv run python script.py (srun-native)

uv: command not found inside container

Container doesn’t have uv

Use the megatron-bridge image built from Dockerfile.ci

No space left on device during uv ops

Cache fills container’s /root/.cache/

Set UV_CACHE_DIR to shared/persistent storage

Pre-commit fails with ruff errors

Code style violations

Run uv run ruff check --fix . && uv run ruff format .


Quick Start Checklist#

  1. Clone the repo and initialize submodules:

    git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge megatron-bridge
    cd megatron-bridge
    git submodule update --init 3rdparty/Megatron-LM
    
  2. Build the container:

    docker build -f docker/Dockerfile.ci --target megatron_bridge -t megatron-bridge:latest .
    
  3. Start a dev shell:

    docker run --rm -it -v $(pwd):/opt/Megatron-Bridge --gpus all --shm-size=24g megatron-bridge:latest bash
    
  4. Install pre-commit hooks (inside container):

    uv run --group dev pre-commit install
    
  5. Run a quick training sanity check:

    uv run python -m torch.distributed.run --nproc_per_node=1 \
      scripts/training/run_recipe.py \
      --recipe vanilla_gpt_pretrain_config \
      train.train_iters=5 train.global_batch_size=8 train.micro_batch_size=4 \
      scheduler.lr_warmup_iters=1 scheduler.lr_decay_iters=5 \
      logger.log_interval=1
    
  6. Create a branch, make changes, and submit a PR:

    git switch -c your-feature-name
    # ... make changes ...
    git add -u && git commit -s -m "[area] type: description"
    git push origin your-feature-name