Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Launch Experiment with NeMo Launcher
Configure and Run a Launcher Pipeline
main.py
is the primary file to execute for running various stages in your pipeline, including data preparation,
training, conversion, fine-tuning, and evaluation.
To run the specified pipelines with the provided configurations, simply execute:
python3 main.py
Modifying Configurations: NeMo launcher uses Hydra and OmegaConf to manage job configurations with YAML files. There are two ways to modify the configurations:
Edit the configuration file directly: You can directly edit the corresponding configuration file to make the necessary changes.
Override configurations through the command line: You can also override some configurations directly through the command line when calling
main.py
. For example, to override the stages list insideconf/config.yaml
, use:python3 main.py stages=[training]
Pipeline Execution Details: After calling main.py
, the NeMo launcher scripts perform several tasks to set up and run your customized pipeline. Here’s an overview of these tasks:
Updating Hydra Configurations: The launcher first updates the default Hydra configurations. These preloaded configuration files are tailored based on the user’s input arguments.
Configuration File Storage: After updating the configurations, the launcher saves these modified settings into a YAML file. This file is then used for subsequent calls to the NeMo Framework.
Launch Script Creation: The launcher then proceeds to generate submission scripts or create specific launch scripts. These scripts incorporate necessary calls to NeMo and other components required for the pipeline. If multiple scripts are involved, they are efficiently streamlined with dependencies, ensuring a coherent workflow based on the selected model type, learning stage, saved configurations, and target platform. This script is essential for guiding the launcher in executing the workflow correctly.
Script Execution on Target Platform: Finally, the launcher executes the script on the chosen platform. The execution method varies depending on the environment: it might use a bash run for interactive (local) setups, an sbatch submission for Slurm setups, etc.
Dry Run: To perform a dry run of your pipeline without actually executing it, you can set the environment
variable NEMO_LAUNCHER_DEBUG=1
. With this environment variable set, running main.py
will generate the interpolated
configuration files, save the YAML file, and create the launch script as usual, but it will not submit or launch the
scripts. This allows you to inspect and verify the generated files and scripts before proceeding with the actual
execution of the pipeline.
By automating these tasks, the NeMo launcher scripts streamline the process of setting up and running the pipeline, making it easier for you to focus on your experiments and analyses.
Note for Base Command Platform Users: When using the Base Command Platform, directly modifying the configuration file is not recommended. This is because different nodes on the Base Command Platform do not share the same file system, so changes made on one node will not be reflected on other nodes. To ensure consistency across nodes, always use command line overrides for configuration changes on Base Command Platform.
Example: Pre-train Stable Diffusion 860M Model for 10 Epochs with Resolution 256
In this example, we will demonstrate how to customize the pipeline according to the following instructions:
Run only the training stage.
Train the
stable_diffusion
model with the860m_res_256
configuration.Change the training epoch from
1
to10
.
Step-by-Step Guide
Include only the training stage: To run only the training stage, update the
stages
list inconf/config.yaml
:stages: - training
Select the ``stable_diffusion`` model with ``860m_res_256_pretrain`` configuration: Update the
training
field in thedefaults
section ofconf/config.yaml
:training: stable_diffusion/860m_res_256_pretrain
Change the training epochs: Navigate to the
conf/training/stable_diffusion/860m_res_256_pretrain.yaml
file and update themax_epochs
field under thetrainer
section:trainer: max_epochs: 10
Pipeline Execution: With these customizations in place, the pipeline will now execute only the
training
stage, using thestable_diffusion
model with the860m_res_256_pretrain
configuration, and train for a total of10
epochs. To run the customized pipeline, simply execute:python3 main.py
Instead of manually editing the configuration files, you can also use Hydra’s override feature to achieve the same customizations in a single command. This allows you to quickly test different configurations without modifying the original files. To run the customized pipeline according to the instructions provided earlier, use the following command:
python3 main.py stages=[training] training=stable_diffusion/860m_res_256_pretrain training.trainer.max_epochs=10
Note: When using Hydra’s override feature, make sure to include the stage name (training in this example) for
overriding a stage configuration found in conf/(stage_name)/(model_type)/(model_name).yaml
. This ensures that the
correct stage and configuration file are targeted for the override.