Architecture#
Audience: Contributors adding new executors, and users who want to understand why something is failing or how to extend NeMo-Run.
Prerequisite: Read Execution, at least one executor guide, and Management first.
run.run() vs run.Experiment#
run.run() is a thin convenience wrapper. Internally it creates an Experiment with a single task and detach=False:
# These two are equivalent
run.run(task, executor=executor)
with run.Experiment("untitled") as exp:
exp.add(task, executor=executor)
exp.run(detach=False)
All the mechanics described below apply to both.
Call chain#
flowchart TD
A["exp.run()"] --> B["Experiment._prepare()"]
B --> C["Job.prepare()"]
C --> D["executor.assign(exp_id, exp_dir, task_id, task_dir)"]
C --> E["executor.create_job_dir()"]
C --> F["package(task, executor) → AppDef + Role(s)"]
A --> G["Job.launch(runner)"]
G --> H["runner.dryrun(AppDef, scheduler_name, cfg=executor)"]
H --> I["scheduler.submit_dryrun(AppDef, executor)"]
G --> J["runner.schedule(dryrun_info)"]
J --> K["scheduler.schedule(dryrun_info) → AppHandle"]
_prepare()callsJob.prepare()for each task, which assigns experiment/job directories, syncs code, and builds the TorchXAppDef.Job.launch(runner)callsrunner.dryrun()to validate the submission plan, thenrunner.schedule()to submit it.The
AppHandlereturned byscheduler.schedule()is stored in the experiment metadata soExperiment.from_id()can reconnect.
Executor → TorchX scheduler mapping#
Each executor is backed by a TorchX scheduler registered as an entry point in pyproject.toml under torchx.schedulers:
Executor |
TorchX Scheduler |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Schedulers are discovered at runtime via torchx.schedulers.get_scheduler_factories().
Key TorchX types#
Type |
What it represents |
|---|---|
|
Full application: list of |
|
One execution unit: entrypoint, args, env, image, num_replicas, resources |
|
Validated |
|
Running job ID: |
|
Status enum: |
How Executor fields map to TorchX concepts#
Executor field |
TorchX mapping |
|---|---|
|
|
|
|
|
|
|
|
|
Pre-launch code sync strategy |
|
Sets path metadata consumed by the scheduler |
Metadata storage layout#
All experiment metadata is written under NEMORUN_HOME (default ~/.nemo_run):
~/.nemo_run/experiments/{title}/{title}_{exp_id}/
├── {task_id}/
│ ├── configs/
│ │ ├── {task_id}_executor.yaml # serialised executor config
│ │ ├── {task_id}_fn_or_script # zlib-JSON encoded task
│ │ └── {task_id}_packager # zlib-JSON encoded packager
│ └── scripts/{task_id}.sh # generated sbatch/shell script
└── .tasks # serialised Job metadata (JSON)
Experiment.from_id() reads .tasks to reconstruct the experiment and reattach to live jobs via the stored AppHandle.
Adding a new executor#
Subclass
Executorinnemo_run/core/execution/:from nemo_run.core.execution.base import Executor @dataclass class MyExecutor(Executor): my_param: str = "default" ...
Implement a TorchX
Schedulerinnemo_run/run/torchx_backend/schedulers/:from torchx.schedulers import Scheduler class MyScheduler(Scheduler): def submit_dryrun(self, app, cfg): ... def schedule(self, dryrun_info): ... def describe(self, app_id): ... def cancel(self, app_id): ...
Register the scheduler as an entry point in
pyproject.toml:[project.entry-points."torchx.schedulers"] my_scheduler = "nemo_run.run.torchx_backend.schedulers.my:create_scheduler"
Add to
EXECUTOR_MAPPINGinnemo_run/run/torchx_backend/schedulers/api.py:EXECUTOR_MAPPING = { ..., MyExecutor: "my_scheduler", }