CLI Framework#
The CLI framework is built on Typer and nemo_runspec. Each command file contains its own execution logic, so you can see exactly how jobs are submitted.
$ uv run nemotron nano3 sft --help
Usage: nemotron nano3 sft [OPTIONS]
Run supervised fine-tuning with Megatron-Bridge (stage1).
╭─ Options ────────────────────────────────────────────────────────────────╮
│ --help -h Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────╯
╭─ Global Options ─────────────────────────────────────────────────────────╮
│ -c, --config NAME Config name or path │
│ -r, --run PROFILE Submit to cluster (attached) │
│ -b, --batch PROFILE Submit to cluster (detached) │
│ -d, --dry-run Preview config without execution │
│ --stage Stage files for interactive debugging │
╰──────────────────────────────────────────────────────────────────────────╯
╭─ Configs (-c/--config) ──────────────────────────────────────────────────╮
│ Built-in: default, tiny │
│ Custom: -c /path/to/your/config.yaml │
╰──────────────────────────────────────────────────────────────────────────╯
╭─ Artifact Overrides (W&B artifact references) ───────────────────────────╮
│ run.model Base model checkpoint artifact │
│ run.data SFT data artifact (Packed Parquet) │
╰──────────────────────────────────────────────────────────────────────────╯
╭─ Run Overrides (override env.toml settings) ─────────────────────────────╮
│ run.env.nodes Number of nodes │
│ run.env.nproc_per_node GPUs per node │
│ run.env.partition Slurm partition │
│ run.env.account Slurm account │
│ run.env.time Job time limit (e.g., 04:00:00) │
│ run.env.container_image Override container image │
╰──────────────────────────────────────────────────────────────────────────╯
╭─ env.toml Profiles ──────────────────────────────────────────────────────╮
│ Available profiles: my-cluster, my-cluster-large │
│ Usage: --run PROFILE or --batch PROFILE │
╰──────────────────────────────────────────────────────────────────────────╯
╭─ Examples ───────────────────────────────────────────────────────────────╮
│ $ ... sft -c tiny Local execution │
│ $ ... sft -c tiny --dry-run Preview config │
│ $ ... sft -c tiny --run my-cluster Submit to cluster │
│ $ ... sft -c tiny -r cluster run.env.nodes=4 │
╰──────────────────────────────────────────────────────────────────────────╯
Overview#
The CLI framework supports:
Nested commands like
uv run nemotron nano3 data prep pretrainConfig integration with automatic YAML loading and dotlist overrides
Artifact resolution mapping W&B artifacts to config fields via
${art:...}Remote execution via NeMo-Run with
--run/--batchVisible execution with all submission logic in each command file
For artifacts, see Nemotron Kit. For execution profiles, see Execution through NeMo-Run. For the full nemo_runspec toolkit, see the nemo_runspec package readme.
Architecture#
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333', 'clusterBkg': '#f5f5f5', 'clusterBorder': '#666666'}}}%%
flowchart LR
subgraph cli["CLI Layer (Typer)"]
Root["nemotron"]
Recipe["nano3"]
Commands["pretrain/sft/rl"]
end
subgraph runspec["nemo_runspec Toolkit"]
Config["Config Loading"]
Env["env.toml Profiles"]
Exec["Execution Helpers"]
end
subgraph execution["Execution Modes"]
Local["Local (torchrun)"]
NemoRun["NeMo-Run (Slurm)"]
Ray["Ray Jobs"]
end
Root --> Recipe --> Commands
Commands --> runspec
runspec --> execution
Command Pattern#
Each CLI command follows the same explicit pattern. Recipe metadata comes from PEP 723 [tool.runspec] blocks in the script itself:
# src/nemotron/cli/commands/nano3/pretrain.py
from nemo_runspec import parse as parse_runspec
from nemo_runspec.config import parse_config, build_job_config, save_configs
from nemo_runspec.execution import create_executor, execute_local, build_env_vars
from nemo_runspec.recipe_config import RecipeConfig, parse_recipe_config
# Metadata read from [tool.runspec] in the training script
SCRIPT_PATH = "src/nemotron/recipes/nano3/stage0_pretrain/train.py"
SPEC = parse_runspec(SCRIPT_PATH)
def _execute_pretrain(cfg: RecipeConfig):
# 1. Parse configuration
train_config = parse_config(cfg.ctx, SPEC.config_dir, SPEC.config.default)
job_config = build_job_config(train_config, cfg.ctx, SPEC.name, ...)
# 2. Save configs
job_dir = generate_job_dir(SPEC.name)
job_path, train_path = save_configs(job_config, ..., job_dir)
# 3. Execute based on mode
if cfg.mode == "local":
execute_local(SCRIPT_PATH, train_path, cfg.passthrough, ...)
else:
# Remote execution - THIS IS WHAT YOU'D CHANGE FOR SKYPILOT
executor = create_executor(env=env, env_vars=env_vars, packager=packager, ...)
with run.Experiment(recipe_name) as exp:
exp.add(script_task, executor=executor)
exp.run(detach=not attached)
# CLI entry point
def pretrain(ctx: typer.Context) -> None:
"""Run pretraining with Megatron-Bridge (stage0)."""
cfg = parse_recipe_config(ctx)
_execute_pretrain(cfg)
Registering Commands#
Commands are registered via RecipeTyper from nemo_runspec, which provides standardized help panels:
# src/nemotron/cli/commands/nano3/_typer_group.py
from nemo_runspec.recipe_typer import RecipeTyper
nano3_app = RecipeTyper(name="nano3", help="Nano3 training recipes")
nano3_app.add_recipe_command(pretrain, meta=PRETRAIN_META, rich_help_panel="Training Stages")
Global Options#
All recipe commands automatically receive these global options:
Option |
Short |
Description |
|---|---|---|
|
|
Config name or path (from |
|
|
Attached NeMo-Run execution (waits, streams logs) |
|
|
Detached NeMo-Run execution (submits, exits) |
|
|
Preview config without executing |
|
Stage files to remote for debugging |
|
|
Dotlist overrides (any position) |
GlobalContext#
Global options are captured in a GlobalContext dataclass (in nemo_runspec.cli_context):
@dataclass
class GlobalContext:
config: str | None = None # -c/--config value
run: str | None = None # --run profile name
batch: str | None = None # --batch profile name
dry_run: bool = False # --dry-run flag
stage: bool = False # --stage flag
dotlist: list[str] # key=value overrides
passthrough: list[str] # Unknown args for script
Key properties:
mode->"run","batch", or"local"profile-> Environment profile name (from--runor--batch)
Configuration Pipeline#
Config loading is handled by nemo_runspec.config:
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart LR
Default["default.yaml"] --> Merge
Config["--config"] --> Merge
Dotlist["key=value"] --> Merge
Merge --> JobConfig["job.yaml"]
Merge --> TrainConfig["train.yaml"]
Two-Config System#
The CLI generates two config files:
File |
Purpose |
|---|---|
|
Full provenance: config + CLI args + env profile |
|
Clean config for script (paths rewritten for remote) |
job.yaml structure:
recipe:
_target_: megatron.bridge.recipes...
per_split_data_args_path: /data/blend.json
train:
train_iters: 1000
run:
mode: "run"
profile: "my-cluster"
env:
executor: "slurm"
nodes: 4
gpus_per_node: 8
cli:
argv: ["nemotron", "nano3", "pretrain", "-c", "tiny", "--run", "my-cluster"]
dotlist: ["train.train_iters=1000"]
wandb:
entity: "nvidia"
project: "nemotron"
Dotlist Overrides#
Override any config value with key.path=value syntax:
# Override nested values
uv run nemotron nano3 pretrain train.train_iters=5000
# Multiple overrides
uv run nemotron nano3 pretrain \
train.train_iters=5000 \
train.micro_batch_size=2 \
run.data=PretrainBlendsArtifact-v2:latest
Execution Modes#
Local Execution#
Without --run or --batch, commands execute locally:
# Local execution (no NeMo-Run)
uv run nemotron nano3 pretrain -c tiny
# Equivalent to:
python -m torch.distributed.run \
--nproc_per_node=1 \
src/nemotron/recipes/nano3/stage0_pretrain/train.py \
--config train.yaml
NeMo-Run Attached (--run)#
Submit job and wait for completion, streaming logs:
uv run nemotron nano3 pretrain -c tiny --run MY-CLUSTER
NeMo-Run Detached (--batch)#
Submit job and exit immediately:
uv run nemotron nano3 pretrain -c tiny --batch MY-CLUSTER
Ray Jobs#
For recipes with launch = "ray" in their [tool.runspec] (data prep, RL), jobs are submitted via Ray:
# Data prep uses Ray for distributed processing
uv run nemotron nano3 data prep pretrain --run MY-CLUSTER
# RL uses Ray for actor orchestration
uv run nemotron nano3 rl -c tiny --run MY-CLUSTER
Artifact Resolution#
Config Resolver#
Use ${art:...} in YAML configs to resolve artifact paths:
run:
data: PretrainBlendsArtifact-default:latest
recipe:
per_split_data_args_path: ${art:data,path}/blend.json
CLI Override#
Override artifacts via dotlist:
uv run nemotron nano3 sft --run MY-CLUSTER \
run.data=PretrainBlendsArtifact-v2:latest \
run.model=ModelArtifact-pretrain:v3
Packager Types#
Control how code is synced to remote (configured in [tool.runspec.run]):
Packager |
Description |
Use Case |
|---|---|---|
|
Minimal sync ( |
Default |
|
Full codebase with exclusions |
Ray jobs needing imports |
|
Inline all |
Isolated scripts |
CLI Examples#
// Preview config without executing
$ uv run nemotron nano3 pretrain -c tiny --dry-run
// Submit to cluster (attached)
$ uv run nemotron nano3 pretrain -c tiny --run MY-CLUSTER
// Submit to cluster (detached)
$ uv run nemotron nano3 pretrain -c tiny --batch MY-CLUSTER
// Override training iterations
$ uv run nemotron nano3 pretrain -c tiny --run MY-CLUSTER train.train_iters=5000
// Stage files for interactive debugging
$ uv run nemotron nano3 pretrain -c tiny --run MY-CLUSTER --stage
// Data preparation (Ray job)
$ uv run nemotron nano3 data prep pretrain --run MY-CLUSTER
// RL training (Ray job)
$ uv run nemotron nano3 rl -c tiny --run MY-CLUSTER
Building a Recipe#
Step 1: Create Recipe Script with Runspec#
# src/nemotron/recipes/myrecipe/train.py
# /// script
# [tool.runspec]
# name = "myrecipe/train"
# image = "nvcr.io/nvidia/nemo:25.11"
#
# [tool.runspec.run]
# launch = "torchrun"
#
# [tool.runspec.config]
# dir = "./config"
# default = "default"
# ///
import argparse
from pathlib import Path
from omegaconf import OmegaConf
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--config", type=Path, required=True)
args, unknown = parser.parse_known_args()
cfg = OmegaConf.load(args.config)
if unknown:
cfg = OmegaConf.merge(cfg, OmegaConf.from_dotlist(unknown))
print(f"Training with {cfg.train.train_iters} iterations")
if __name__ == "__main__":
main()
Step 2: Create Config Directory#
src/nemotron/recipes/myrecipe/
├── config/
│ ├── default.yaml
│ └── tiny.yaml
└── train.py # Script with [tool.runspec] block
Step 3: Create CLI Command#
# src/nemotron/cli/commands/myrecipe/train.py
from nemo_runspec import parse as parse_runspec
from nemo_runspec.config import parse_config, build_job_config, save_configs, generate_job_dir
from nemo_runspec.execution import create_executor, execute_local, build_env_vars
from nemo_runspec.env import parse_env
from nemo_runspec.recipe_config import RecipeConfig, parse_recipe_config
from nemo_runspec.recipe_typer import RecipeMeta
import typer
SCRIPT_PATH = "src/nemotron/recipes/myrecipe/train.py"
SPEC = parse_runspec(SCRIPT_PATH)
META = RecipeMeta(
name=SPEC.name,
script_path=SCRIPT_PATH,
config_dir=str(SPEC.config_dir),
default_config=SPEC.config.default,
)
def _execute_train(cfg: RecipeConfig):
train_config = parse_config(cfg.ctx, SPEC.config_dir, SPEC.config.default)
env = parse_env(cfg.ctx)
job_config = build_job_config(train_config, cfg.ctx, SPEC.name, SCRIPT_PATH, cfg.argv, env_profile=env)
if cfg.dry_run:
return
job_dir = generate_job_dir(SPEC.name)
_, train_path = save_configs(job_config, ..., job_dir)
if cfg.mode == "local":
execute_local(SCRIPT_PATH, train_path, cfg.passthrough, torchrun=True)
else:
# Remote execution via nemo-run
...
def train(ctx: typer.Context) -> None:
"""Run training for my recipe."""
cfg = parse_recipe_config(ctx)
_execute_train(cfg)
Step 4: Register and Run#
# src/nemotron/cli/commands/myrecipe/_typer_group.py
from nemo_runspec.recipe_typer import RecipeTyper
from .train import META, train
app = RecipeTyper(name="myrecipe", help="My training recipe")
app.add_recipe_command(train, meta=META)
# Test locally
uv run nemotron myrecipe train -c tiny
# Run on cluster
uv run nemotron myrecipe train -c tiny --run MY-CLUSTER
API Reference#
nemo_runspec Toolkit#
Export |
Module |
Description |
|---|---|---|
|
|
Parse |
|
|
Load YAML config with dotlist overrides |
|
|
Build full job config with provenance |
|
|
Save job.yaml and train.yaml |
|
|
Load env.toml profile |
|
|
Parse CLI options into RecipeConfig |
|
|
Typer subclass with rich help panels |
|
|
Recipe metadata for help panels |
|
|
Shared CLI state |
|
|
Build NeMo-Run executor |
|
|
Run locally via torchrun |
|
|
Build env vars (HF, W&B tokens) |
Further Reading#
Nemotron Kit – artifacts and lineage tracking
nemo_runspecPackage – toolkit documentationExecution through NeMo-Run – execution profiles and env.toml
Data Preparation – data preparation module
Artifact Lineage – W&B artifact system and lineage tracking
W&B Integration – credentials and configuration
Nano3 Recipe – training recipe example