CLI Architecture#
This document explains how Nemotronβs CLI layer works and how to modify it for your needs.
Overview#
The CLI layer (src/nemotron/cli/) handles execution β how jobs are submitted and tracked. Each command file contains visible execution logic, making it easy to understand and modify.
The shared toolkit lives in the nemo_runspec package: config loading, env.toml profiles, execution helpers, packaging, and display utilities. CLI commands import from nemo_runspec and wire things together explicitly.
Design Principle: Visible Execution#
Nemotron makes execution explicit β all nemo-run setup lives directly in each command function:
# src/nemotron/cli/commands/nano3/pretrain.py
from nemo_runspec import parse as parse_runspec
from nemo_runspec.config import parse_config, build_job_config, save_configs
from nemo_runspec.execution import create_executor, execute_local, build_env_vars
# Metadata comes from PEP 723 [tool.runspec] in the script itself
SCRIPT_PATH = "src/nemotron/recipes/nano3/stage0_pretrain/train.py"
SPEC = parse_runspec(SCRIPT_PATH)
# Execution logic is VISIBLE in the function
def _execute_pretrain(cfg: RecipeConfig, *, experiment=None):
# 1. Parse configuration
train_config = parse_config(cfg.ctx, SPEC.config_dir, SPEC.config.default)
job_config = build_job_config(train_config, cfg.ctx, SPEC.name, ...)
# 2. Build executor - THIS IS WHAT YOU'D CHANGE FOR SKYPILOT
executor = create_executor(env=env, env_vars=env_vars, packager=packager, ...)
# 3. Run experiment
with run.Experiment(recipe_name) as exp:
exp.add(script_task, executor=executor)
exp.run(detach=not attached)
Components#
Runspec (PEP 723 metadata)#
Recipe scripts are self-describing. Each script declares its identity, container image, launch method, and config location via a [tool.runspec] TOML block:
# /// script
# [tool.runspec]
# name = "nano3/pretrain"
# image = "nvcr.io/nvidia/nemo:25.11.nemotron_3_nano"
#
# [tool.runspec.run]
# launch = "torchrun"
#
# [tool.runspec.config]
# dir = "./config"
# default = "default"
# ///
The CLI reads this via nemo_runspec.parse(), returning a frozen Runspec dataclass. See nemo_runspec package for the full schema.
RecipeConfig#
Parsed CLI options. Handles late globals (--run after subcommand) and dotlist overrides:
from nemo_runspec.recipe_config import RecipeConfig, parse_recipe_config
cfg = parse_recipe_config(ctx)
# Now you have:
# cfg.mode - "run", "batch", or "local"
# cfg.attached - True if --run, False if --batch
# cfg.profile - The env profile name
# cfg.passthrough - Args to pass through to script
# cfg.dry_run - True if --dry-run
RecipeTyper#
Standardizes command registration with proper context settings and rich help panels:
from nemo_runspec.recipe_typer import RecipeTyper, RecipeMeta
app = RecipeTyper(name="nano3", help="Nano3 training recipes")
app.add_recipe_command(
pretrain,
meta=PRETRAIN_META, # RecipeMeta with config_dir, artifacts, etc.
rich_help_panel="Training Stages",
)
Config Pipeline#
Config loading uses nemo_runspec.config:
from nemo_runspec.config import parse_config, build_job_config, extract_train_config, save_configs
# 1. Load YAML config with dotlist overrides
train_config = parse_config(ctx, config_dir, default_config)
# 2. Build full job config with provenance (env profile, CLI args, etc.)
job_config = build_job_config(train_config, ctx, recipe_name, script_path, argv, env_profile=env)
# 3. Extract clean train config for the script
train_config_for_script = extract_train_config(job_config, for_remote=True)
# 4. Save both configs to job directory
job_path, train_path = save_configs(job_config, train_config_for_script, job_dir)
Directory Structure#
src/nemotron/
βββ cli/ # EXECUTION LAYER
β βββ bin/
β β βββ nemotron.py # Main entry point (typer app)
β βββ commands/
β β βββ evaluate.py # Top-level evaluate command
β β βββ nano3/
β β βββ _typer_group.py # Command registration (RecipeTyper)
β β βββ pretrain.py # Pretrain execution logic
β β βββ sft.py # SFT execution logic
β β βββ rl.py # RL execution logic (Ray)
β β βββ eval.py # Evaluation command
β β βββ pipe.py # Pipeline: pretrain β sft composition
β β βββ data/
β β β βββ prep/ # Data prep commands
β β β β βββ pretrain.py
β β β β βββ sft.py
β β β β βββ rl.py
β β β βββ import_/ # Data import commands
β β β βββ pretrain.py
β β β βββ sft.py
β β β βββ rl.py
β β βββ model/ # Model import/eval commands
β β βββ eval.py
β β βββ import_/
β β βββ pretrain.py
β β βββ sft.py
β β βββ rl.py
β βββ kit/ # Kit CLI commands (squash, etc.)
β
βββ recipes/ # RUNTIME LAYER
β βββ nano3/
β βββ stage0_pretrain/
β β βββ train.py # -> Megatron-Bridge
β β βββ data_prep.py # -> Data preparation
β βββ stage1_sft/
β β βββ train.py # -> Megatron-Bridge
β β βββ data_prep.py # -> Data preparation
β βββ stage2_rl/
β βββ train.py # -> NeMo-RL
β βββ data_prep.py # -> Data preparation
src/nemo_runspec/ # SHARED TOOLKIT
βββ _parser.py # PEP 723 [tool.runspec] parsing
βββ _models.py # Runspec, RunspecRun, RunspecConfig, RunspecResources
βββ config/ # Config loading and OmegaConf resolvers
β βββ loader.py # parse_config, build_job_config, save_configs
β βββ resolvers.py # ${art:...}, ${auto_mount:...}
βββ env.py # env.toml profile loading with inheritance
βββ cli_context.py # GlobalContext (shared CLI state)
βββ recipe_config.py # RecipeConfig + parse_recipe_config
βββ recipe_typer.py # RecipeTyper + RecipeMeta
βββ help.py # Rich help panels
βββ display.py # Dry-run and job submission display
βββ execution.py # Startup commands, env vars, executor creation
βββ run.py # RunConfig, nemo-run patches
βββ packaging/ # SelfContainedPackager, CodePackager
βββ squash.py # Container squash utilities
βββ pipeline.py # Pipeline orchestration
βββ step.py # Step definition for pipelines
βββ evaluator.py # NeMo Evaluator integration
βββ artifact_registry.py # ArtifactRegistry (fsspec/wandb)
βββ exceptions.py # ArtifactNotFoundError, etc.
Execution Patterns#
Training (Slurm + torchrun)#
Pretrain and SFT use Slurm with torchrun launcher:
# cli/commands/nano3/pretrain.py, cli/commands/nano3/sft.py
executor = create_executor(env=env, env_vars=env_vars, packager=packager, ...)
script_task = run.Script(path="main.py", args=[...], entrypoint="python")
with run.Experiment(recipe_name) as exp:
exp.add(script_task, executor=executor)
exp.run()
RL (Ray)#
RL uses Ray for distributed execution:
# cli/commands/nano3/rl.py
from nemo_run.run.ray.job import RayJob
executor = create_executor(env=env, ...) # Still Slurm for infrastructure
ray_job = RayJob(name=job_name, executor=executor)
ray_job.start(
command=cmd,
workdir=str(Path.cwd()) + "/",
pre_ray_start_commands=setup_commands,
)
Data Prep (Ray + Code Packager)#
Data prep uses Ray with full codebase rsync. The [tool.runspec] in the data prep script declares launch = "ray":
# cli/commands/nano3/data/prep/pretrain.py
SPEC = parse_runspec(SCRIPT_PATH) # launch=ray, cmd template from [tool.runspec]
How to Fork for Different Backends#
Example: Replace nemo-run with SkyPilot#
Read the current execution logic in
cli/commands/nano3/pretrain.pyReplace
_execute_remote()with SkyPilot equivalents:
# cli/commands/nano3/pretrain.py (forked for SkyPilot)
def _execute_skypilot(cfg: RecipeConfig):
import sky
# Config loading stays the same
train_config = parse_config(cfg.ctx, SPEC.config_dir, SPEC.config.default)
job_config = build_job_config(train_config, ...)
job_dir = generate_job_dir(SPEC.name)
_, train_path = save_configs(job_config, ..., job_dir)
task = sky.Task(
run="python main.py --config config.yaml",
workdir=str(job_dir),
num_nodes=env_config.get("nodes", 1),
)
task.set_resources(sky.Resources(
cloud=sky.AWS(),
accelerators=f"A100:{env_config.get('gpus_per_node', 8)}",
))
sky.launch(task, cluster_name="nano3-pretrain")
Keep config loading β
nemo_runspec.configworks with any backendKeep the
[tool.runspec]block β metadata stays with the script
CLI Behavior Reference#
Feature |
How It Works |
|---|---|
Late globals |
|
Dotlist overrides |
Applied during |
Packager selection |
|
Ray execution |
Visible in |
Rich help panels |
|
env.toml profiles |
Loaded via |