Launch Experiment with NeMo Launcher

main.py is the primary file to execute for running various stages in your pipeline, including data preparation, training, conversion, fine-tuning, and evaluation.

To run the specified pipelines with the provided configurations, simply execute:

Copy
Copied!
            

python3 main.py

Modifying Configurations: NeMo launcher uses Hydra and OmegaConf to manage job configurations with YAML files. There are two ways to modify the configurations:

  1. Edit the configuration file directly: You can directly edit the corresponding configuration file to make the necessary changes.

  2. Override configurations through the command line: You can also override some configurations directly through the command line when calling main.py. For example, to override the stages list inside conf/config.yaml, use:

    Copy
    Copied!
                

    python3 main.py stages=[training]

Pipeline Execution Details: After calling main.py, the NeMo launcher scripts perform several tasks to set up and run your customized pipeline. Here’s an overview of these tasks:

  1. Updating Hydra Configurations: The launcher first updates the default Hydra configurations. These preloaded configuration files are tailored based on the user’s input arguments.

  2. Configuration File Storage: After updating the configurations, the launcher saves these modified settings into a YAML file. This file is then used for subsequent calls to the NeMo Framework.

  3. Launch Script Creation: The launcher then proceeds to generate submission scripts or create specific launch scripts. These scripts incorporate necessary calls to NeMo and other components required for the pipeline. If multiple scripts are involved, they are efficiently streamlined with dependencies, ensuring a coherent workflow based on the selected model type, learning stage, saved configurations, and target platform. This script is essential for guiding the launcher in executing the workflow correctly.

  4. Script Execution on Target Platform: Finally, the launcher executes the script on the chosen platform. The execution method varies depending on the environment: it might use a bash run for interactive (local) setups, an sbatch submission for Slurm setups, etc.

Dry Run: To perform a dry run of your pipeline without actually executing it, you can set the environment variable NEMO_LAUNCHER_DEBUG=1. With this environment variable set, running main.py will generate the interpolated configuration files, save the YAML file, and create the launch script as usual, but it will not submit or launch the scripts. This allows you to inspect and verify the generated files and scripts before proceeding with the actual execution of the pipeline.

By automating these tasks, the NeMo launcher scripts streamline the process of setting up and running the pipeline, making it easier for you to focus on your experiments and analyses.

Note for Base Command Platform Users: When using the Base Command Platform, directly modifying the configuration file is not recommended. This is because different nodes on the Base Command Platform do not share the same file system, so changes made on one node will not be reflected on other nodes. To ensure consistency across nodes, always use command line overrides for configuration changes on Base Command Platform.

In this example, we will demonstrate how to customize the pipeline according to the following instructions:

  1. Run only the training stage.

  2. Train the stable_diffusion model with the 860m_res_256 configuration.

  3. Change the training epoch from 1 to 10.

Step-by-Step Guide

  1. Include only the training stage: To run only the training stage, update the stages list in conf/config.yaml:

    Copy
    Copied!
                

    stages: - training

  2. Select the ``stable_diffusion`` model with ``860m_res_256_pretrain`` configuration: Update the training field in the defaults section of conf/config.yaml:

    Copy
    Copied!
                

    training: stable_diffusion/860m_res_256_pretrain

  3. Change the training epochs: Navigate to the conf/training/stable_diffusion/860m_res_256_pretrain.yaml file and update the max_epochs field under the trainer section:

    Copy
    Copied!
                

    trainer: max_epochs: 10

  4. Pipeline Execution: With these customizations in place, the pipeline will now execute only the training stage, using the stable_diffusion model with the 860m_res_256_pretrain configuration, and train for a total of 10 epochs. To run the customized pipeline, simply execute:

    Copy
    Copied!
                

    python3 main.py

Instead of manually editing the configuration files, you can also use Hydra’s override feature to achieve the same customizations in a single command. This allows you to quickly test different configurations without modifying the original files. To run the customized pipeline according to the instructions provided earlier, use the following command:

Copy
Copied!
            

python3 main.py stages=[training] training=stable_diffusion/860m_res_256_pretrain training.trainer.max_epochs=10

Note: When using Hydra’s override feature, make sure to include the stage name (training in this example) for overriding a stage configuration found in conf/(stage_name)/(model_type)/(model_name).yaml. This ensures that the correct stage and configuration file are targeted for the override.

Previous Understand the Configurations
Next Logs and Results
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.