Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Management#

Experiment#

class nemo_run.run.experiment.Experiment( title: str, executor: Executor | None = None, id: str | None = None, log_level: str = 'INFO', _reconstruct: bool = False, )#

Bases: object

A context manager to launch and manage multiple runs, all using pure Python.

run.Experiment provides researchers with a simple and flexible way to create and manage their ML experiments.

Building on the core blocks of nemo_run, the Experiment can be used as an umbrella under which a user can launch different configured functions on multiple remote clusters.

The Experiment takes care of storing the run metadata, launching it on the specified cluster, and syncing the logs and artifacts.

Additionally, the Experiment also provides management tools to easily inspect and reproduce past experiments. Some of the use-cases that it enables are listed below:

Check the status and logs of a past experiment
Reconstruct a past experiment and relaunch it after some changes
Compare different runs of the same experiment.

This API allows users to programmatically define their experiments. To get a glance of the flexibility provided, here are some use cases which can be supported by the Experiment in just a few lines of code.

Launch a benchmarking run on different GPUs at the same time in parallel
Launch a sequential data processing pipeline on a CPU heavy cluster
Launch hyperparameter grid search runs on a single cluster in parallel
Launch hyperparameter search runs distributed across all available clusters

The design is heavily inspired from XManager.

Under the hood, the Experiment metadata is stored in the local filesystem inside a user specified directory controlled by NEMORUN_HOME env var. We will explore making the metadata more persistent in the future.

Note

Experiment.add and Experiment.run methods inside Experiment can currently only be used within its context manager.

Examples

# An experiment that runs a pre-configured training example
# on multiple GPU specific clusters (A100 and H100 shown here) in parallel using torchrun
# Assumes that example_to_run is pre-configured using run.Partial
with run.Experiment("example-multiple-gpus", executor="h100_cluster") as exp:
    # Set up the run on H100
    # Setting up a single task is identical to setting up a single run outside the experiment
    h100_cluster: run.SlurmExecutor = exp.executor.clone()
    h100_cluster.nodes = 2

    # torchrun manages the processes on a single node
    h100_cluster.ntasks_per_node = 1
    h100_cluster.gpus_per_task = 8

    h100_cluster.packager.subpath = "subpath/to/your/code/repo"
    h100_cluster.launcher = "torchrun"

    exp.add(
        "example_h100",
        fn=example_to_run,
        tail_logs=True,
        executor=h100_cluster,
    )

    # Set up the run on A100
    a100_cluster: run.Config[SlurmExecutor] = h100_cluster.clone()
    a100_cluster.tunnel = run.Config(
        SSHTunnel,
        host=os.environ["A100_HOST"],
        user="your_user_in_cluster",
        identity="path_to_your_ssh_key"
    )

    exp.add(
        "example_a100",
        fn=example_to_run,
        tail_logs=True,
        executor=a100_cluster,
    )

    # Runs all the task in the experiment.
    # By default, all tasks will be run in parallel if all different executors support parallel execution.
    # You can set sequential=True to run the tasks sequentially.
    exp.run()

# Upon exiting the context manager, the Experiment will automatically wait for all tasks to complete,
# and optionally tail logs for tasks that have tail_logs=True.
# A detach mode (if the executors support it) will be available soon.
# Once all tasks have completed,
# the Experiment will display a status table and clean up resources like ssh tunnels.

# You can also manage the experiment at a later point in time
exp = run.Experiment.from_title("example-multiple-gpus")
exp.status()
exp.logs(task_id="example_a100")

GOODBYE_MESSAGE_BASH = '\n# You can inspect this experiment at a later point in time using the CLI as well:\nnemorun experiment status {exp_id}\nnemorun experiment logs {exp_id} 0\nnemorun experiment cancel {exp_id} 0\n'#

GOODBYE_MESSAGE_PYTHON = '\n# The experiment was run with the following tasks: {tasks}\n# You can inspect and reconstruct this experiment at a later point in time using:\nexperiment = run.Experiment.from_id("{exp_id}")\nexperiment.status() # Gets the overall status\nexperiment.logs("{tasks[0]}") # Gets the log for the provided task\nexperiment.cancel("{tasks[0]}") # Cancels the provided task if still running\n'#

add( fn_or_script: Partial | Script | list[Partial | Script], executor: Executor | list[Executor] | None = None, name: str = '', plugins: list[ExperimentPlugin] | None = None, tail_logs: bool = False, )#: Add a configured function along with its executor config to the experiment.

cancel(task_id: str)#: Cancels an existing task if still running.

classmethod catalog(title: str = '') → list[str]#: List all experiments inside NEMORUN_HOME, optionally with the provided title.

dryrun()#: Logs the raw scripts that will be executed for each task.

classmethod from_id(id: str) → Experiment#: Reconstruct an experiment with the specified id.

classmethod from_title( title: str, ) → Experiment#: Reconstruct an experiment with the specified title.

logs(task_id: str, regex: str | None = None)#: Prints the logs of the specified task_id, optionally filtered by regex.

reset()#: Resets an experiment to make it ready for a relaunch. Only works if the current experiment run has already been launched.

run( sequential: bool = False, detach: bool = False, tail_logs: bool = False, direct: bool = False, )#

Runs all the tasks in the experiment.

By default, all tasks are run in parallel.

If sequential=True, all tasks will be run one after the other. The order is based on the order in which they were added.

Parallel mode only works if all exectuors in the experiment support it. Currently, all executors support parallel mode.

In sequential mode, if all executor supports dependencies, then all tasks will be scheduled at once by specifying the correct dependencies to each task. Otherwise, the experiment.run call will block and each task that is scheduled will be executed sequentially. In this particular case, we cannot guarantee the state of the exeperiment if the process exits in the middle.

Currently, only the slurm executor supports dependencies.

Parameters:

sequential – If True, runs all tasks sequentially in the order they were added. Defaults to False.
detach – If True, detaches from the process after launching the tasks. Only supported for Slurm and Skypilot. Defaults to False.
tail_logs – If True, tails logs from all tasks in the experiment. If False, relies on task specific setting. Defaults to False.
direct – If True, runs all tasks in the experiment sequentially in the same process. Note that if direct=True, then sequential also will be True. Defaults to False.

status()#: Prints a table specifying the status of all tasks.

Note

status is not supported for local executor and the status for a task using the local executor will be listed as UNKNOWN in most cases