CLI Architecture#
This document explains how Nemotron’s CLI layer works and how to modify it for your needs.
Overview#
The CLI layer (src/nemotron/cli/) handles execution – how jobs are submitted and tracked. Each command file contains visible execution logic, making it easy to understand and modify.
The shared toolkit lives in the nemo_runspec package: config loading, env.toml profiles, execution helpers, packaging, and display utilities. CLI commands import from nemo_runspec and wire things together explicitly.
Design Principle: Visible Execution#
Nemotron makes execution explicit – all nemo-run setup lives directly in each command function:
# src/nemotron/cli/commands/nano3/pretrain.py
from nemo_runspec import parse as parse_runspec
from nemo_runspec.config import parse_config, build_job_config, save_configs
from nemo_runspec.execution import create_executor, execute_local, build_env_vars
# Metadata comes from PEP 723 [tool.runspec] in the script itself
SCRIPT_PATH = "src/nemotron/recipes/nano3/stage0_pretrain/train.py"
SPEC = parse_runspec(SCRIPT_PATH)
# Execution logic is VISIBLE in the function
def _execute_pretrain(cfg: RecipeConfig, *, experiment=None):
# 1. Parse configuration
train_config = parse_config(cfg.ctx, SPEC.config_dir, SPEC.config.default)
job_config = build_job_config(train_config, cfg.ctx, SPEC.name, ...)
# 2. Build executor - THIS IS WHAT YOU'D CHANGE FOR SKYPILOT
executor = create_executor(env=env, env_vars=env_vars, packager=packager, ...)
# 3. Run experiment
with run.Experiment(recipe_name) as exp:
exp.add(script_task, executor=executor)
exp.run(detach=not attached)
Components#
Runspec (PEP 723 metadata)#
Recipe scripts are self-describing. Each script declares its identity, container image, launch method, and config location via a [tool.runspec] TOML block:
# /// script
# [tool.runspec]
# name = "nano3/pretrain"
# image = "nvcr.io/nvidia/nemo:25.11.nemotron_3_nano"
#
# [tool.runspec.run]
# launch = "torchrun"
#
# [tool.runspec.config]
# dir = "./config"
# default = "default"
# ///
The CLI reads this via nemo_runspec.parse(), returning a frozen Runspec dataclass. See nemo_runspec package for the full schema.
RecipeConfig#
Parsed CLI options. Handles late globals (--run after subcommand) and dotlist overrides:
from nemo_runspec.recipe_config import RecipeConfig, parse_recipe_config
cfg = parse_recipe_config(ctx)
# Now you have:
# cfg.mode - "run", "batch", or "local"
# cfg.attached - True if --run, False if --batch
# cfg.profile - The env profile name
# cfg.passthrough - Args to pass through to script
# cfg.dry_run - True if --dry-run
RecipeTyper#
Standardizes command registration with proper context settings and rich help panels:
from nemo_runspec.recipe_typer import RecipeTyper, RecipeMeta
app = RecipeTyper(name="nano3", help="Nano3 training recipes")
app.add_recipe_command(
pretrain,
meta=PRETRAIN_META, # RecipeMeta with config_dir, artifacts, etc.
rich_help_panel="Training Stages",
)
Config Pipeline#
Config loading uses nemo_runspec.config:
from nemo_runspec.config import parse_config, build_job_config, extract_train_config, save_configs
# 1. Load YAML config with dotlist overrides
train_config = parse_config(ctx, config_dir, default_config)
# 2. Build full job config with provenance (env profile, CLI args, etc.)
job_config = build_job_config(train_config, ctx, recipe_name, script_path, argv, env_profile=env)
# 3. Extract clean train config for the script
train_config_for_script = extract_train_config(job_config, for_remote=True)
# 4. Save both configs to job directory
job_path, train_path = save_configs(job_config, train_config_for_script, job_dir)
Directory Structure#
src/nemotron/
├── cli/ # EXECUTION LAYER
│ ├── bin/
│ │ └── nemotron.py # Main entry point (typer app)
│ ├── commands/
│ │ ├── evaluate.py # Top-level evaluate command
│ │ ├── nano3/
│ │ ├── _typer_group.py # Command registration (RecipeTyper)
│ │ ├── pretrain.py # Pretrain execution logic
│ │ ├── sft.py # SFT execution logic
│ │ ├── rl.py # RL execution logic (Ray)
│ │ ├── eval.py # Evaluation command
│ │ ├── pipe.py # Pipeline: pretrain → sft composition
│ │ ├── data/
│ │ │ ├── prep/ # Data prep commands
│ │ │ │ ├── pretrain.py
│ │ │ │ ├── sft.py
│ │ │ │ └── rl.py
│ │ │ └── import_/ # Data import commands
│ │ │ ├── pretrain.py
│ │ │ ├── sft.py
│ │ │ └── rl.py
│ │ └── model/ # Model import/eval commands
│ │ ├── eval.py
│ │ └── import_/
│ │ ├── pretrain.py
│ │ ├── sft.py
│ │ └── rl.py
│ │ └── omni3/
│ │ ├── _typer_group.py # Registers build, sft, eval, and RL sub-groups
│ │ ├── build.py # Stage-local container build dispatcher
│ │ ├── sft.py # Omni SFT execution logic
│ │ ├── eval.py # Omni evaluation command
│ │ ├── data/ # Data prep commands
│ │ ├── model/ # Model import/export/eval commands
│ │ └── rl/ # RL MPO / text / vision sub-stages
│ └── kit/ # Kit CLI commands (squash, etc.)
│
├── recipes/ # RUNTIME LAYER
│ ├── data/
│ │ ├── curation/ # Data curation pipelines (for example nemotron-cc)
│ │ └── sdg/ # Synthetic data generation pipelines
│ ├── omni3/
│ │ ├── stage0_sft/
│ │ │ ├── Dockerfile # Stage-owned container recipe
│ │ │ ├── build.py # Exports the OCI archive used downstream
│ │ │ ├── train.py # -> Megatron-Bridge
│ │ │ └── data_prep.py # -> Valor32k staging / validation
│ │ ├── stage1_rl/
│ │ │ ├── Dockerfile # Shared RL container recipe
│ │ │ ├── build.py # Exports the RL OCI archive
│ │ │ ├── data_prep.py # -> RL data prep dispatcher
│ │ │ ├── stage1_mpo/
│ │ │ ├── stage2_text_rl/
│ │ │ └── stage3_vision_rl/
│ │ # NOTE: omni3 eval stage is on the roadmap; not yet present
│ └── nano3/
│ ├── stage0_pretrain/
│ │ ├── train.py # -> Megatron-Bridge
│ │ └── data_prep.py # -> Data preparation
│ ├── stage1_sft/
│ │ ├── train.py # -> Megatron-Bridge
│ │ └── data_prep.py # -> Data preparation
│ └── stage2_rl/
│ ├── train.py # -> NeMo-RL
│ └── data_prep.py # -> Data preparation
src/nemo_runspec/ # SHARED TOOLKIT
├── _parser.py # PEP 723 [tool.runspec] parsing
├── _models.py # Runspec, RunspecRun, RunspecConfig, RunspecResources
├── config/ # Config loading and OmegaConf resolvers
│ ├── loader.py # parse_config, build_job_config, save_configs
│ └── resolvers.py # ${art:...}, ${auto_mount:...}
├── env.py # env.toml profile loading with inheritance
├── cli_context.py # GlobalContext (shared CLI state)
├── recipe_config.py # RecipeConfig + parse_recipe_config
├── recipe_typer.py # RecipeTyper + RecipeMeta
├── help.py # Rich help panels
├── display.py # Dry-run and job submission display
├── execution.py # Startup commands, env vars, executor creation
├── run.py # RunConfig, nemo-run patches
├── packaging/ # SelfContainedPackager, CodePackager
├── squash.py # Container squash utilities
├── pipeline.py # Pipeline orchestration
├── step.py # Step definition for pipelines
├── evaluator.py # NeMo Evaluator integration
├── artifact_registry.py # ArtifactRegistry (fsspec/wandb)
└── exceptions.py # ArtifactNotFoundError, etc.
Execution Patterns#
Training (Slurm + torchrun)#
Pretrain and SFT use Slurm with torchrun launcher:
# cli/commands/nano3/pretrain.py, cli/commands/nano3/sft.py
executor = create_executor(env=env, env_vars=env_vars, packager=packager, ...)
script_task = run.Script(path="main.py", args=[...], entrypoint="python")
with run.Experiment(recipe_name) as exp:
exp.add(script_task, executor=executor)
exp.run()
RL (Ray)#
RL uses Ray for distributed execution:
# cli/commands/nano3/rl.py
from nemo_run.run.ray.job import RayJob
executor = create_executor(env=env, ...) # Still Slurm for infrastructure
ray_job = RayJob(name=job_name, executor=executor)
ray_job.start(
command=cmd,
workdir=str(Path.cwd()) + "/",
pre_ray_start_commands=setup_commands,
)
Data Prep (Ray + Code Packager)#
Data prep uses Ray with full codebase rsync. The [tool.runspec] in the data prep script declares launch = "ray":
# cli/commands/nano3/data/prep/pretrain.py
SPEC = parse_runspec(SCRIPT_PATH) # launch=ray, cmd template from [tool.runspec]
How to Fork for Different Backends#
Example: Replace nemo-run with SkyPilot#
Read the current execution logic in
cli/commands/nano3/pretrain.pyReplace
_execute_remote()with SkyPilot equivalents:
# cli/commands/nano3/pretrain.py (forked for SkyPilot)
def _execute_skypilot(cfg: RecipeConfig):
import sky
# Config loading stays the same
train_config = parse_config(cfg.ctx, SPEC.config_dir, SPEC.config.default)
job_config = build_job_config(train_config, ...)
job_dir = generate_job_dir(SPEC.name)
_, train_path = save_configs(job_config, ..., job_dir)
task = sky.Task(
run="python main.py --config config.yaml",
workdir=str(job_dir),
num_nodes=env_config.get("nodes", 1),
)
task.set_resources(sky.Resources(
cloud=sky.AWS(),
accelerators=f"A100:{env_config.get('gpus_per_node', 8)}",
))
sky.launch(task, cluster_name="nano3-pretrain")
Keep config loading –
nemo_runspec.configworks with any backendKeep the
[tool.runspec]block – metadata stays with the script
CLI Behavior Reference#
Feature |
How It Works |
|---|---|
Late globals |
|
Dotlist overrides |
Applied during |
Packager selection |
|
Ray execution |
Visible in |
Build verb |
Family-specific |
Omni3 family layout |
Top-level |
Rich help panels |
|
env.toml profiles |
Loaded via |