Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Quickstart Guide for NeMo Launcher

Installation Steps

  1. Clone the NeMo Launcher:

    Start by cloning the repository from GitHub. This is where you’ll find the necessary launcher scripts:

    git clone https://github.com/NVIDIA/NeMo-Framework-Launcher
    

    Locate the scripts in NeMo-Framework-Launcher/launcher_scripts.

  2. Set Up Your Python Environment:

    Install the required packages to prepare your environment (it is recommended to use a virtual environment):

    python -m venv my_project_env
    source my_project_env/bin/activate
    pip install -r requirements.txt
    
  3. Additional Setup Requirements:

    Ensure you have the following ready:

    1. A NeMo Framework container.

    2. A dataset for training, tuning, or evaluation.

    3. Your Wandb key for logging to a wandb server.

    4. The NeMo FW source code, if using a custom version of NeMo.

Starting the NeMo Framework Container

Use commands appropriate for your environment (like srun, docker run, etc.) to run the container, ensuring necessary launcher and data folders are mounted.

Example: GPT3 5B Pretraining

Configuration:

  • The launcher uses hierarchical configurations, with the main file at conf/config.yaml.

  • This example employs the GPT3 5B training configuration from conf/training/gpt3/5b.yaml.

  • Edit the config files or use command-line arguments for modifications. For more information, see the Hydra tutorial.

Execution Script: Go to the launcher scripts directory and run the following commands for training:

cd /path/to/NeMo-Framework-Launcher/launcher_scripts
python3 main.py \
    stages=[training] \
    launcher_scripts_path=/path/to/launcher_scripts \
    data_dir=/path/to/dataset/the_pile_gpt3 \
    wandb_api_key_file=/path/to/wandb_key \
    cluster_type=interactive \
    training=gpt3/5b \
    training.trainer.max_time=00:03:50:00 \
    training.trainer.num_nodes=1

Your training logs and results will be located in /path/to/launcher_scripts/results/gpt3_5b.