nemo_run#

This module serves as the main entrypoint for the NeMo-Run Python library, providing programmatic access to its core functionalities for configuring, packaging, and launching experiments across various execution environments.

Classes#

`LazyEntrypoint`	A class for lazy initialization and configuration of entrypoints.
`Config`	Wrapper around fdl.Config with nemo_run specific functionality.
`ConfigurableMixin`	A mixin class that provides configuration and visualization functionality.
`Partial`	Wrapper around fdl.Partial with nemo_run specific functionality.
`Script`	Dataclass to configure raw scripts.
`Executor`	Base dataclass for configuration of an executor.
`ExecutorMacros`	Defines macros.
`DGXCloudExecutor`	Dataclass to configure a DGX Executor.
`DockerExecutor`	Dataclass to configure a docker based executor.
`FaultTolerance`	A mixin class that provides configuration and visualization functionality.
`SlurmRay`	Transforms a provided cmd into a Ray launcher bash script for SlurmExecutor.
`SlurmTemplate`	A generic launcher that uses Jinja2 templates to wrap commands.
`Torchrun`	A mixin class that provides configuration and visualization functionality.
`LeptonExecutor`	Dataclass to configure a Lepton Executor.
`LocalExecutor`	Dataclass to configure local executor.
`SkypilotExecutor`	Dataclass to configure a Skypilot Executor.
`SlurmExecutor`	Dataclass to configure a Slurm Cluster.
`SkypilotJobsExecutor`	Dataclass to configure a Skypilot Jobs Executor.
`GitArchivePackager`	Uses git archive for packaging your code.
`HybridPackager`	A packager that combines multiple other packagers into one final archive.
`Packager`	Base class for packaging your code.
`PatternPackager`	Will package all the files from the specified pattern.
`LocalTunnel`	Local Tunnel for supported executors. Executes all commands locally.
`SSHTunnel`	SSH Tunnel for supported executors.
`Experiment`	A context manager to launch and manage multiple runs, all using pure Python.
`Plugin`	A base class for plugins that can be used to modify experiments, tasks, and executors.

Functions#

`autoconvert`(…)	The autoconvert function is a powerful and flexible decorator for Python functions that can
`import_executor`(→ Executor)	Retrieves an executor instance by name from a specified or default Python file.
`help`(→ None)	Outputs help for the passed Callable
`run`(fn_or_script[, executor, plugins, name, dryrun, ...])	Runs a single configured function on the specified executor.

Package Contents#

nemo_run.autoconvert( fn: Callable[P, nemo_run.config.Config[T]], *, partial: bool = False, ) → Callable[P, nemo_run.config.Config[T]][source]#

nemo_run.autoconvert( fn: Callable[P, nemo_run.config.Partial[T]], *, partial: bool = False, ) → Callable[P, nemo_run.config.Partial[T]]

nemo_run.autoconvert( fn: Callable[P, T], *, partial: bool = False, ) → Callable[P, nemo_run.config.Config[T]]

nemo_run.autoconvert( *, partial: Literal[True] = ..., ) → Callable[[Callable[P, T] | Callable[P, nemo_run.config.Config[T]] | Callable[P, nemo_run.config.Partial[T]]], Callable[P, nemo_run.config.Partial[T]]]

nemo_run.autoconvert( *, partial: Literal[False] = False, ) → Callable[[Callable[P, T] | Callable[P, nemo_run.config.Config[T]] | Callable[P, nemo_run.config.Partial[T]]], Callable[P, nemo_run.config.Config[T]]]

The autoconvert function is a powerful and flexible decorator for Python functions that can modify the behavior of the function it decorates by converting the returned object in a nested manner to: run.Config (when partial is False) or run.Partial (when partial is True). This conversion is done by a provided conversion function to_buildable_fn, which defaults to default_autoconfig_buildable. Under the hood, it uses fiddle’s autoconfig to parse the function’s AST and convert objects to their run.Config/run.Partial counterparts.

You can use it in two different ways:

Directly as a decorator for a function you define:
```
@autoconvert
def my_func(param1: int, param2: str) -> MyType:
    return MyType(param1=param1, param2=param2)
```
This will return run.Config(MyType, param1=param1, param2=param2) when called, assuming that partial=False (otherwise, it would be a run.Partial instance).
Indirectly, as a way to convert an existing function:
```
def my_func(param1: int, param2: str) -> MyType:
    return MyType(param1=param1, param2=param2)

my_new_func = autoconvert(partial=True)(my_func)
```
Now, calling my_new_func will actually return run.Partial(MyType, param1=param1, param2=param2) rather than a MyType instance.

Parameters:

fn:
The function to be decorated. This parameter is optional, and if not provided, autoconvert acts as a decorator factory. Defaults to None.
partial:
A boolean flag that indicates whether the return type of fn should be converted to Partial[T] (if True) or Config[T] (if False). Defaults to False.
to_buildable_fn:
The conversion function to be used for the desired output type. This function takes another function and any positional and keyword arguments and returns an instance of either Config[T] or Partial[T]. By default, it uses default_autoconfig_buildable.

Bases: fiddle.Buildable

A class for lazy initialization and configuration of entrypoints.

This class allows for the creation of a configurable entrypoint that can be modified with overwrites, which are only applied when the resolve method is called.

__getattr__(item: str) → LazyEntrypoint[source]#

Handle attribute access by returning a new LazyFactory with an updated path.

Parameters:: item – The attribute name being accessed.
Returns:: A new LazyFactory instance with the updated path.

__setattr__(item: str, value: Any)[source]#

Handle attribute assignment by storing the value in overwrites.

Parameters:

item – The attribute name being assigned.
value – The value to assign to the attribute.

__build__(*args, **kwargs)[source]#: Builds output for this instance; see subclasses for details.

class nemo_run.Config(

fn_or_cls: fiddle.Buildable[_T] | fiddle._src.config.TypeOrCallableProducingT[_T],

*args,

bind_args: bool = True,

**kwargs,

)[source]#

Bases: Generic[_T], fiddle.Config[_T], _CloneAndFNMixin, _VisualizeMixin

Wrapper around fdl.Config with nemo_run specific functionality. See fdl.Config for more.

class nemo_run.ConfigurableMixin[source]#

Bases: _VisualizeMixin

A mixin class that provides configuration and visualization functionality.

This mixin adds methods for converting objects to Config instances, visualizing configurations, and comparing configurations.

For classes that are not dataclasses, the to_config method needs to be overridden to provide custom conversion logic to Config instances.

diff(old: typing_extensions.Self, trim=True, **kwargs)[source]#

Generate a visual difference between this configuration and an old one.

Parameters:

old (Self) – The old configuration to compare against.
trim (bool, optional) – Whether to trim unchanged parts. Defaults to True.
**kwargs – Additional arguments to pass to render_diff.

Returns:

A graph representing the differences between configurations.

Return type:

graphviz.Digraph

to_config() → Config[typing_extensions.Self][source]#

Convert the current object to a Config instance.

This method automatically converts dataclasses to Config instances. For classes that are not dataclasses, this method needs to be overridden to provide custom conversion logic.

Returns:: A Config representation of the current object.
Return type:: Config
Raises:: NotImplementedError – If the object type cannot be converted to Config or if the method is not overridden for non-dataclass types.

Note

For classes that are not dataclasses, you must override this method to define how the object should be converted to a Config instance.

class nemo_run.Partial(

fn_or_cls: fiddle.Buildable[_T] | fiddle._src.config.TypeOrCallableProducingT[_T],

*args,

bind_args: bool = True,

**kwargs,

)[source]#

Bases: Generic[_T], fiddle.Partial[_T], _CloneAndFNMixin, _VisualizeMixin

Wrapper around fdl.Partial with nemo_run specific functionality. See fdl.Partial for more.

class nemo_run.Script[source]#

Bases: ConfigurableMixin

Dataclass to configure raw scripts.

Examples:

file_based_script = run.Script("./scripts/echo.sh")

inline_script = run.Script(
    inline="""
env
echo "Hello 1"
echo "Hello 2"
"""
)

class nemo_run.Executor[source]#

Bases: nemo_run.config.ConfigurableMixin

Base dataclass for configuration of an executor. This cannot be used independently but you can use this as the base type to register executor factories.

See LocalExecutor and SlurmExecutor for examples.

abstractmethod assign(exp_id: str, exp_dir: str, task_id: str, task_dir: str) → None[source]#: This function will be called by run.Experiment to assign the executor for the specific experiment.

abstractmethod nnodes() → int[source]#: Helper function called by torchrun component to determine –nnodes.

abstractmethod nproc_per_node() → int[source]#: Helper function called by torchrun component to determine –nproc-per-node.

macro_values() → ExecutorMacros | None[source]#: Get macro values specific to the executor. This allows replacing common macros with executor specific vars for node ips, etc.

class nemo_run.ExecutorMacros[source]#

Bases: nemo_run.config.ConfigurableMixin

Defines macros.

apply(role: torchx.specs.Role) → torchx.specs.Role[source]#: apply applies the values to a copy the specified role and returns it.

substitute(arg: str) → str[source]#: substitute applies the values to the template arg.

nemo_run.import_executor(

name: str,

file_path: str | None = None,

call: bool = True,

**kwargs,

) → Executor[source]#

Retrieves an executor instance by name from a specified or default Python file. The file must contain either a function or executor instance by the provided name.

This function dynamically imports the file_path, searches for the name attr and returns the value corresponding to the given name, and optionally calls the value if call is True.

This functionality allows you to define all your executors in a single file which lives separately from your codebase. It is similar to ~/.ssh/config and allows you to use executors across your projects without having to redefine them.

Example

executor = import_executor(“local”, file_path=”path/to/executors.py”) executor = import_executor(“gpu”) # Uses the default location of os.path.join(get_nemorun_home(), “executors.py”)

Parameters:

name (str) – The name of the executor to retrieve.
file_path (Optional[str]) –
The path to the Python file containing the executor definitions. Defaults to None, in which case the default location of os.path.join(get_nemorun_home(), “executors.py”) is used.

The file_path is expected to be a string representing a file path with the following structure: - It should be a path to a Python file (with a .py extension). - The file should contain a dictionary named EXECUTOR_MAP that maps executor names to their corresponding instances. - The file can be located anywhere in the file system, but if not provided, it defaults to get_nemorun_home()/executors.py.
call (bool) – If True, the value from the module is called with the rest of the given kwargs.

Returns:

The executor instance corresponding to the given name.

Return type:

Executor

class nemo_run.DGXCloudExecutor[source]#

Bases: nemo_run.core.execution.base.Executor

Dataclass to configure a DGX Executor.

This executor integrates with a DGX cloud endpoint for launching jobs via a REST API. It acquires an auth token, identifies the project/cluster, and launches jobs with a specified command. It can be adapted to meet user authentication and job-submission requirements on DGX.

create_data_mover_workload(token: str, project_id: str, cluster_id: str)[source]#: Creates a CPU only workload to move job directory into PVC using the provided project/cluster IDs.

move_data(token: str, project_id: str, cluster_id: str, sleep: float = 10) → None[source]#: Moves job directory into PVC and deletes the workload after completion

create_training_job( token: str, project_id: str, cluster_id: str, name: str, ) → requests.Response[source]#

Creates a training job on DGX Cloud using the provided project/cluster IDs. For multi-node jobs, creates a distributed workload. Otherwise creates a single-node training.

Parameters:

token – Authentication token for DGX Cloud API
project_id – ID of the project to create the job in
cluster_id – ID of the cluster to create the job on
name – Name for the job

Returns:

Response object from the API request

nnodes() → int[source]#: Helper function called by torchrun component to determine –nnodes.

nproc_per_node() → int[source]#: Helper function called by torchrun component to determine –nproc-per-node.

assign(exp_id: str, exp_dir: str, task_id: str, task_dir: str)[source]#: This function will be called by run.Experiment to assign the executor for the specific experiment.

macro_values() → nemo_run.core.execution.base.ExecutorMacros | None[source]#: Get macro values specific to the executor. This allows replacing common macros with executor specific vars for node ips, etc.

class nemo_run.DockerExecutor[source]#

Bases: nemo_run.core.execution.base.Executor

Dataclass to configure a docker based executor.

All configuration is passed to https://docker-py.readthedocs.io

Example:

DockerExecutor(
    container_image="python:3.12",
    num_gpus=-1,
    runtime="nvidia",
    ipc_mode="host",
    shm_size="30g",
    volumes=["/src/path:/dst/path"],
    env_vars={"PYTHONUNBUFFERED": "1"},
    packager=run.Packager(),
)

assign(exp_id: str, exp_dir: str, task_id: str, task_dir: str)[source]#: This function will be called by run.Experiment to assign the executor for the specific experiment.

nnodes() → int[source]#: Helper function called by torchrun component to determine –nnodes.

nproc_per_node() → int[source]#: Helper function called by torchrun component to determine –nproc-per-node.

class nemo_run.FaultTolerance[source]#

Bases: Launcher

A mixin class that provides configuration and visualization functionality.

This mixin adds methods for converting objects to Config instances, visualizing configurations, and comparing configurations.

For classes that are not dataclasses, the to_config method needs to be overridden to provide custom conversion logic to Config instances.

class nemo_run.SlurmRay[source]#

Bases: SlurmTemplate

Transforms a provided cmd into a Ray launcher bash script for SlurmExecutor. The Ray launcher script sets up a Ray cluster on Slurm nodes, with the head node starting Ray head and executing the provided command. Worker nodes start Ray and wait.

class nemo_run.SlurmTemplate[source]#

Bases: Launcher

A generic launcher that uses Jinja2 templates to wrap commands. The template can be provided either as inline content or as a path to a template file.

get_template_content() → str[source]#: Get the template content either from the file or inline content.

render_template(cmd: list[str]) → str[source]#: Render the template with the command and additional variables.

transform(cmd: list[str]) → nemo_run.config.Script | None[source]#: Transform the command using the template.

class nemo_run.Torchrun[source]#

Bases: Launcher

A mixin class that provides configuration and visualization functionality.

This mixin adds methods for converting objects to Config instances, visualizing configurations, and comparing configurations.

For classes that are not dataclasses, the to_config method needs to be overridden to provide custom conversion logic to Config instances.

class nemo_run.LeptonExecutor[source]#

Bases: nemo_run.core.execution.base.Executor

Dataclass to configure a Lepton Executor.

This executor integrates with a Lepton endpoint for launching jobs via a REST API. It acquires an auth token, identifies the project/cluster, and launches jobs with a specified command. It can be adapted to meet user authentication and job-submission requirements on Lepton.

stop_job(job_id: str)[source]#: Send a stop signal to the requested job

move_data( sleep: float = 10, timeout: int = 600, poll_interval: int = 5, unknowns_grace_period: int = 60, ) → None[source]#: Moves job directory into remote storage and deletes the workload after completion.

create_lepton_job(name: str)[source]#: Creates a distributed PyTorch job using the provided project/cluster IDs.

nnodes() → int[source]#: Helper function called by torchrun component to determine –nnodes.

nproc_per_node() → int[source]#: Helper function called by torchrun component to determine –nproc-per-node.

assign(exp_id: str, exp_dir: str, task_id: str, task_dir: str)[source]#: This function will be called by run.Experiment to assign the executor for the specific experiment.

macro_values() → nemo_run.core.execution.base.ExecutorMacros | None[source]#: Get macro values specific to the executor. This allows replacing common macros with executor specific vars for node ips, etc.

class nemo_run.LocalExecutor[source]#

Bases: nemo_run.core.execution.base.Executor

Dataclass to configure local executor.

Example:

run.LocalExecutor()

assign(exp_id: str, exp_dir: str, task_id: str, task_dir: str)[source]#: This function will be called by run.Experiment to assign the executor for the specific experiment.

nnodes() → int[source]#: Helper function called by torchrun component to determine –nnodes.

nproc_per_node() → int[source]#: Helper function called by torchrun component to determine –nproc-per-node.

class nemo_run.SkypilotExecutor[source]#

Bases: nemo_run.core.execution.base.Executor

Dataclass to configure a Skypilot Executor.

Some familiarity with Skypilot is necessary. In order to use this executor, you need to install NeMo Run with either skypilot (for only Kubernetes) or skypilot-all (for all clouds) optional features.

Example:

skypilot = SkypilotExecutor(
    gpus="A10G",
    gpus_per_node=devices,
    container_image="nvcr.io/nvidia/nemo:dev",
    infra="k8s/my-context",
    network_tier="best",
    cluster_name="nemo_tester",
    file_mounts={
        "nemo_run.whl": "nemo_run.whl",
        "/workspace/code": "/local/path/to/code",
    },
    storage_mounts={
        "/workspace/outputs": {
            "name": "my-training-outputs",
            "store": "gcs",  # or "s3", "azure", etc.
            "mode": "MOUNT",
            "persistent": True,
        },
        "/workspace/checkpoints": {
            "name": "model-checkpoints",
            "store": "s3",
            "mode": "MOUNT",
            "persistent": True,
        }
    },
    setup="""
conda deactivate
nvidia-smi
ls -al ./
pip install nemo_run.whl
cd /opt/NeMo && git pull origin main && pip install .
    """,
)

assign(exp_id: str, exp_dir: str, task_id: str, task_dir: str)[source]#: This function will be called by run.Experiment to assign the executor for the specific experiment.

nnodes() → int[source]#: Helper function called by torchrun component to determine –nnodes.

nproc_per_node() → int[source]#: Helper function called by torchrun component to determine –nproc-per-node.

macro_values() → nemo_run.core.execution.base.ExecutorMacros | None[source]#: Get macro values specific to the executor. This allows replacing common macros with executor specific vars for node ips, etc.

class nemo_run.SlurmExecutor[source]#

Bases: nemo_run.core.execution.base.Executor

Dataclass to configure a Slurm Cluster. During execution, sbatch related parameters will automatically get parsed to their corresponding sbatch flags.

Note

We assume that the underlying Slurm cluster has Pyxis enabled. The slurm executor will fail if the slurm cluster doesn’t support pyxis.

Example:

def your_slurm_executor() -> run.SlurmExecutor:
    ssh_tunnel = SSHTunnel(
        host=os.environ["SLURM_HOST"],
        user=os.environ["SLURM_USER"],
        job_dir=os.environ["SLURM_JOBDIR"],
    )
    packager = GitArchivePackager()
    launcher = "torchrun"
    executor = SlurmExecutor(
        account=os.environ["SLURM_ACCT"],
        partition=os.environ["SLURM_PARTITION"],
        nodes=1,
        ntasks_per_node=1,
        tunnel=ssh_tunnel,
        container_image=os.environ["BASE_IMAGE"],
        time="00:30:00",
        packager=packager,
        launcher=launcher,
    )
    return executor

...

your_executor = your_slurm_executor()

assign(exp_id: str, exp_dir: str, task_id: str, task_dir: str)[source]#: This function will be called by run.Experiment to assign the executor for the specific experiment.

parse_deps() → list[str][source]#: Helper function to parse a list of TorchX app handles and return a list of Slurm Job IDs to use as dependencies.

nnodes() → int[source]#: Helper function called by torchrun component to determine –nnodes.

nproc_per_node() → int[source]#: Helper function called by torchrun component to determine –nproc-per-node.

macro_values() → nemo_run.core.execution.base.ExecutorMacros | None[source]#: Get macro values specific to the executor. This allows replacing common macros with executor specific vars for node ips, etc.

class nemo_run.SkypilotJobsExecutor[source]#

Bases: nemo_run.core.execution.base.Executor

Dataclass to configure a Skypilot Jobs Executor.

This executor launches managed jobs and requires the Skypilot API Server <https://docs.skypilot.co/en/latest/reference/api-server/api-server.html>.

Some familiarity with Skypilot is necessary. In order to use this executor, you need to install NeMo Run with either skypilot (for only Kubernetes) or skypilot-all (for all clouds) optional features.

Example:

skypilot = SkypilotJobsExecutor(
    gpus="A10G",
    gpus_per_node=devices,
    container_image="nvcr.io/nvidia/nemo:dev",
    infra="k8s/my-context",
    network_tier="best",
    cluster_name="nemo_tester",
    file_mounts={
        "nemo_run.whl": "nemo_run.whl",
        "/workspace/code": "/local/path/to/code",
    },
    storage_mounts={
        "/workspace/outputs": {
            "name": "my-training-outputs",
            "store": "gcs",  # or "s3", "azure", etc.
            "mode": "MOUNT",
            "persistent": True,
        },
        "/workspace/checkpoints": {
            "name": "model-checkpoints",
            "store": "s3",
            "mode": "MOUNT",
            "persistent": True,
        }
    },
    setup="""
conda deactivate
nvidia-smi
ls -al ./
pip install nemo_run.whl
cd /opt/NeMo && git pull origin main && pip install .
    """,
)

assign(exp_id: str, exp_dir: str, task_id: str, task_dir: str)[source]#: This function will be called by run.Experiment to assign the executor for the specific experiment.

nnodes() → int[source]#: Helper function called by torchrun component to determine –nnodes.

nproc_per_node() → int[source]#: Helper function called by torchrun component to determine –nproc-per-node.

macro_values() → nemo_run.core.execution.base.ExecutorMacros | None[source]#: Get macro values specific to the executor. This allows replacing common macros with executor specific vars for node ips, etc.

class nemo_run.GitArchivePackager[source]#

Bases: nemo_run.core.packaging.base.Packager

Uses git archive for packaging your code.

At a high level, it works in the following way:

base_path = git rev-parse --show-toplevel.
Optionally define a subpath as base_path/self.subpath by setting subpath attribute.
cd base_path && git archive --format=tar.gz --output={output_file} {self.ref}:{subpath}
This extracted tar file becomes the working directory for your job.

Note

git archive will only package code committed in the specified ref. Any uncommitted code will not be packaged. We are working on adding an option to package uncommitted code but it is not ready yet.

class nemo_run.HybridPackager[source]#

Bases: nemo_run.core.packaging.base.Packager

A packager that combines multiple other packagers into one final archive. Each subpackager is mapped to a target directory name, which will become the top-level folder under which that packager’s content is placed.

If extract_at_root is True, the contents of each sub-packager are extracted directly at the root of the final archive (i.e. without being nested in a subfolder).

class nemo_run.Packager[source]#

Bases: nemo_run.config.ConfigurableMixin

Base class for packaging your code.

The packager is generally used as part of an Executor and provides the executor with information on how to package your code.

It can also include information on how to run your code. For example, a packager can determine whether to use torchrun or whether to use debug flags.

Note

This class can also be used independently as a passthrough packager. This is useful in cases where you do not need to package code. For example, a local executor which uses your current working directory or an executor that uses a docker image that has all the code included.

setup()[source]#: This is run on the executor before starting your job.

class nemo_run.PatternPackager[source]#

Bases: nemo_run.core.packaging.base.Packager

Will package all the files from the specified pattern.

class nemo_run.LocalTunnel[source]#

Bases: Tunnel

Local Tunnel for supported executors. Executes all commands locally. Currently only supports SlurmExecutor. Use if you are launching from login/other node inside the cluster.

class nemo_run.SSHTunnel[source]#

Bases: Tunnel

SSH Tunnel for supported executors. Currently only supports SlurmExecutor.

Uses key based authentication if identity is provided else password authentication.

Examples

ssh_tunnel = SSHTunnel(
    host=os.environ["SSH_HOST"],
    user=os.environ["SSH_USER"],
    job_dir=os.environ["REMOTE_JOBDIR"],
)

another_ssh_tunnel = SSHTunnel(
    host=os.environ["ANOTHER_SSH_HOST"],
    user=os.environ["ANOTHER_SSH_USER"],
    job_dir=os.environ["ANOTHER_REMOTE_JOBDIR"],
    identity="path_to_private_key"
)

setup()[source]#: Creates the job dir if it doesn’t exist

nemo_run.help( entity: Callable, with_docs: bool = True, console=None, namespace: str | None = None, ) → None[source]#: Outputs help for the passed Callable along with all factories registered for the Callable’s args. Optionally outputs docstrings as well.

nemo_run.run( fn_or_script: fiddle.Buildable | nemo_run.config.Script, executor: nemo_run.core.execution.base.Executor | None = None, plugins: nemo_run.run.plugin.ExperimentPlugin | List[nemo_run.run.plugin.ExperimentPlugin] | None = None, name: str = '', dryrun: bool = False, direct: bool = False, detach: bool = False, tail_logs: bool = True, log_level: str = 'INFO', )[source]#

Runs a single configured function on the specified executor. If no executor is specified, it runs the run.Partial function directly i.e. equivalent to calling the python function directly.

Examples

import nemo_run as run

# Run it directly in the same process
run.run(configured_fn)

# Do a dryrun
run.run(configured_fn, dryrun=True)

# Specify a custom executor
local_executor = LocalExecutor()
run.run(configured_fn, executor=local_executor)

slurm_executor = run.SlurmExecutor(...)
run.run(configured_fn, executor=slurm_executor)

class nemo_run.Experiment( title: str, executor: nemo_run.core.execution.base.Executor | None = None, id: str | None = None, log_level: str = 'INFO', _reconstruct: bool = False, jobs: list[nemo_run.run.job.Job | nemo_run.run.job.JobGroup] | None = None, base_dir: str | None = None, clean_mode: bool = False, enable_goodbye_message: bool = True, threadpool_workers: int = 16, skip_status_at_exit: bool = False, serialize_metadata_for_scripts: bool = True, )[source]#

Bases: nemo_run.config.ConfigurableMixin

A context manager to launch and manage multiple runs, all using pure Python.

run.Experiment provides researchers with a simple and flexible way to create and manage their ML experiments.

Building on the core blocks of nemo_run, the Experiment can be used as an umbrella under which a user can launch different configured functions on multiple remote clusters.

The Experiment takes care of storing the run metadata, launching it on the specified cluster, and syncing the logs and artifacts.

Additionally, the Experiment also provides management tools to easily inspect and reproduce past experiments. Some of the use-cases that it enables are listed below:

Check the status and logs of a past experiment
Reconstruct a past experiment and relaunch it after some changes
Compare different runs of the same experiment.

This API allows users to programmatically define their experiments. To get a glance of the flexibility provided, here are some use cases which can be supported by the Experiment in just a few lines of code.

Launch a benchmarking run on different GPUs at the same time in parallel
Launch a sequential data processing pipeline on a CPU heavy cluster
Launch hyperparameter grid search runs on a single cluster in parallel
Launch hyperparameter search runs distributed across all available clusters

The design is heavily inspired from XManager.

Under the hood, the Experiment metadata is stored in the local filesystem inside a user specified directory controlled by get_nemorun_home() env var. We will explore making the metadata more persistent in the future.

Note

Experiment.add and Experiment.run methods inside Experiment can currently only be used within its context manager.

Examples

# An experiment that runs a pre-configured training example
# on multiple GPU specific clusters (A100 and H100 shown here) in parallel using torchrun
# Assumes that example_to_run is pre-configured using run.Partial
with run.Experiment("example-multiple-gpus", executor="h100_cluster") as exp:
    # Set up the run on H100
    # Setting up a single task is identical to setting up a single run outside the experiment
    h100_cluster: run.SlurmExecutor = exp.executor.clone()
    h100_cluster.nodes = 2

    # torchrun manages the processes on a single node
    h100_cluster.ntasks_per_node = 1
    h100_cluster.gpus_per_task = 8

    h100_cluster.packager.subpath = "subpath/to/your/code/repo"
    h100_cluster.launcher = "torchrun"

    exp.add(
        "example_h100",
        fn=example_to_run,
        tail_logs=True,
        executor=h100_cluster,
    )

    # Set up the run on A100
    a100_cluster: run.Config[SlurmExecutor] = h100_cluster.clone()
    a100_cluster.tunnel = run.Config(
        SSHTunnel,
        host=os.environ["A100_HOST"],
        user="your_user_in_cluster",
        identity="path_to_your_ssh_key"
    )

    exp.add(
        "example_a100",
        fn=example_to_run,
        tail_logs=True,
        executor=a100_cluster,
    )

    # Runs all the task in the experiment.
    # By default, all tasks will be run in parallel if all different executors support parallel execution.
    # You can set sequential=True to run the tasks sequentially.
    exp.run()

# Upon exiting the context manager, the Experiment will automatically wait for all tasks to complete,
# and optionally tail logs for tasks that have tail_logs=True.
# A detach mode (if the executors support it) will be available soon.
# Once all tasks have completed,
# the Experiment will display a status table and clean up resources like ssh tunnels.

# You can also manage the experiment at a later point in time
exp = run.Experiment.from_title("example-multiple-gpus")
exp.status()
exp.logs(task_id="example_a100")

classmethod catalog(title: str = '') → list[str][source]#: List all experiments inside get_nemorun_home(), optionally with the provided title.

classmethod from_id(id: str) → Experiment[source]#: Reconstruct an experiment with the specified id.

classmethod from_title(title: str) → Experiment[source]#: Reconstruct an experiment with the specified title.

to_config() → nemo_run.config.Config[source]#

Convert the current object to a Config instance.

This method automatically converts dataclasses to Config instances. For classes that are not dataclasses, this method needs to be overridden to provide custom conversion logic.

Returns:: A Config representation of the current object.
Return type:: Config
Raises:: NotImplementedError – If the object type cannot be converted to Config or if the method is not overridden for non-dataclass types.

Note

For classes that are not dataclasses, you must override this method to define how the object should be converted to a Config instance.

add( task: nemo_run.config.Partial | nemo_run.config.Script | list[nemo_run.config.Partial | nemo_run.config.Script], executor: nemo_run.core.execution.base.Executor | list[nemo_run.core.execution.base.Executor] | None = None, name: str = '', plugins: list[nemo_run.run.plugin.ExperimentPlugin] | None = None, tail_logs: bool = False, dependencies: list[str] | None = None, ) → str[source]#: Add a configured function along with its executor config to the experiment.

dryrun(log: bool = True, exist_ok: bool = False, delete_exp_dir: bool = True)[source]#: Logs the raw scripts that will be executed for each task.

run( sequential: bool = False, detach: bool = False, tail_logs: bool = False, direct: bool = False, )[source]#

Runs all the tasks in the experiment.

By default, all tasks are run in parallel.

If sequential=True, all tasks will be run one after the other. The order is based on the order in which they were added.

Parallel mode only works if all executors in the experiment support it. Currently, all executors support parallel mode.

In sequential mode, if all executor supports dependencies, then all tasks will be scheduled at once by specifying the correct dependencies to each task. Otherwise, the experiment.run call will block and each task that is scheduled will be executed sequentially. In this particular case, we cannot guarantee the state of the exeperiment if the process exits in the middle.

Currently, only the slurm executor supports dependencies.

Parameters:

sequential – If True, runs all tasks sequentially in the order they were added. Defaults to False.
detach – If True, detaches from the process after launching the tasks. Only supported for Slurm and Skypilot. Defaults to False.
tail_logs – If True, tails logs from all tasks in the experiment. If False, relies on task specific setting. Defaults to False.
direct – If True, runs all tasks in the experiment sequentially in the same process. Note that if direct=True, then sequential also will be True. Defaults to False.

status(return_dict: bool = False) → dict[str, dict[str, str]] | None[source]#: Prints a table specifying the status of all tasks.

Note

status is not supported for local executor and the status for a task using the local executor will be listed as UNKNOWN in most cases

cancel(job_id: str)[source]#: Cancels an existing job if still running.

logs(job_id: str, regex: str | None = None)[source]#: Prints the logs of the specified job_id, optionally filtered by regex.

reset() → Experiment[source]#: Resets an experiment to make it ready for a relaunch. Only works if the current experiment run has already been launched.

class nemo_run.Plugin#

Bases: nemo_run.config.ConfigurableMixin

A base class for plugins that can be used to modify experiments, tasks, and executors.

setup( task: nemo_run.config.Partial | nemo_run.config.Script, executor: nemo_run.core.execution.base.Executor, )#

A hook method for setting up tasks and executors together.

This method is intended to be overridden by subclasses to perform custom setup.