Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Prepare Environment

NeMo Framework uses a set of Docker containers executed locally on a single node or on a Slurm cluster (using the pyxis plug-in) or a Base Command Platform cluster. The container also includes conversion scripts. For inference, the container includes the NVIDIA Triton Inference Server with the TensorRT-LLM backend installed.

Note

Ensure that the high-speed file system is mounted on the job submission node(s) at the same path as on the compute nodes.

Slurm

The NeMo Framework codebase is included as part of the training container. To copy it to a local directory in the cluster, you must extract it from the container. You can execute the following command to copy the code to a directory named /path/to/local/dir. The NeMo Framework repository for Slurm has been verified on Base Command Manager as well as Slurm-based DeepOps clusters.

srun -p <partition> -N 1 --container-mounts=/path/to/local/dir:/workspace/mount_dir --container-image=<container_tag> bash -c "cp -r /opt/NeMo-Framework-Launcher /workspace/mount_dir/"

Install the NeMo Framework scripts’ dependencies on the head node of the cluster:

pip install -r requirements.txt

You can use venv to kept your head node environment in its original state for other Python projects. If your configuration lacks pip, you can install pip from get_pip.py with just python3. To install the dependencies to a virtual environment run the following commands before the ‘pip install’ above:

python3 -m venv venv
source ./venv/bin/activate

Base Command Platform

The nemo_megatron_launcher codebase is included as part of the training container. Before you start, set up the NGC CLI and configure it as described in Base Command Platform User Guide.

This guide’s examples mainly use two Base Command Platform workspaces, one for storing the training dataset, and another for storing the results, checkpoints, and logs. Therefore, start by creating these workspaces. NVIDIA suggests that you give them names like nemo_megatron_data_ws and nemo_megatron_results_ws. Base Command Platform User Guide explains how to create and work with Base Command Platform workspaces.

Kubernetes

Data preparation, base model training, evaluation, and conversion of GPT models is currently supported on vanilla kubernetes (k8s) clusters. The launcher scripts will generate a Helm chart for each task based on the config files and launch the job using the chart.

The following is required for running jobs on Kubernetes:
  • One or more DGX A100s/H100s as worker nodes

  • An NFS filesystem where the data and launcher scripts will be stored which is accessible on all worker and controller nodes

  • A head/controller node which has access to the worker nodes and can run kubectl and helm to launch jobs and can install Python dependencies

  • Recent versions of the GPU, Network, and KubeFlow Operators installed

A secret key needs to be configured to allow kubernetes to pull from the private registry. For example, if pulling the container directly from NGC, a secret needs to be created to authenticate with the private NGC registry, such as the following:

kubectl create secret docker-registry ngc-registry --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-password=<NGC KEY HERE>

The launcher scripts need to be downloaded to the NFS filesystem that is connected to the worker nodes. This can either be copied at /opt/NeMo-Framework-Launcher from inside the training container or by cloning this repository.

Install the NeMo Framework scripts dependencies on the head node/controller of the cluster where jobs will be launched:

pip install -r requirements.txt

General Configuration

All config files referenced in this section and throughout are in launcher_scripts (except for the AutoConfigurator’s config files which are in auto_configurator).

The first configuration you must set is launcher_scripts_path in the file conf/config.yaml. This configuration must point to the absolute path where the nemo_megatron_launcher repository is stored in the file system.

Additionally, if you are using a Slurm-based cluster, the file (conf/cluster/bcm.yaml) has the configurations needed to set generic cluster related information, such as the partition or account.

For Kubernetes clusters, the k8s configuration file (conf/cluster/k8s.yaml) has necessary parameters to configure the scripts to run on the cluster, such as ib_resource_name, nfs_server, and pull_secret.

The NUMA mapping can also be configured from the file conf/config.yaml. The mapping should be automatic; the code will read the number of CPU cores available in your cluster, and provide the best possible mapping, to maximize performance. The mapping is enabled by default, but it can be disabled by setting enable: False in the numa_mapping section of conf/config.yaml. The type of mapping can also be configured using the same file. See the full configurations below:

numa_mapping:
  enable: True  # Set to False to disable all mapping (performance will suffer).
  mode: unique_contiguous  # One of: all, single, single_unique, unique_interleaved or unique_contiguous.
  scope: node  # Either node or socket.
  cores: all_logical  # Either all_logical or single_logical.
  balanced: True  # Whether to assing an equal number of physical cores to each process.
  min_cores: 1  # Minimum number of physical cores per process.
  max_cores: 8  # Maximum number of physical cores per process. Can be null to use all available cores.
  • Slurm: The launcher_scripts_path will automatically be mounted to the container at the same path as in the local file system. Any additional directories that should be mounted must be specified using the container_mounts configuration. If the paths contain the colon character (:), the code will assume both the source and destination paths are provided. Otherwise, the given paths will be mounted to the same path inside the container. The data_dir configuration can also be modified to point to where the dataset will be loaded from or saved. The base_results_dir can also be modified to point to where the results, checkpoints and logs will be stored. These last two directories will be automatically mounted into the container. The configurations cluster and cluster_type must be set to bcm for all of the tasks.

  • Base Command Platform: Set launcher_scripts_path to /opt/NeMo-Framework-Launcher/launcher_scripts, the default location of the scripts in the container.

  • Kubernetes: The launcher_scripts_path parameter needs to be set to the NFS path where the NeMo-Framework-Launcher code is located.

    Modify data_dir to point to the location from which the dataset is loaded or to which it is saved, and base_results_dir to the directory in which results, checkpoints, and logs are to be stored. In the case of Base Command Platform, NVIDIA recommends that you point data_dir``to one of the workspaces, and ``base_results_dir to the other. Both must be mounted in read-write mode. The configuration cluster_type must be set to bcp for all of the tasks.

    The file main.py is the top-level file that is executed to run the data preparation, training, conversion, fine-tuning, and evaluation pipelines. The file conf/config.yaml contains configurations that determine which of these pipelines is run. In Slurm-based clusters all of them can be set to True at the same time, and they will be executed in order. In Base Command Platform, though, only one of them may be set to True at a given time.

  • Settings for GPT Models: To use the default settings for GPT models, set config/config.yaml as below:

    stages:
      - data_preparation
      - training
      - conversion
      - evaluation
      - export
    
  • Settings for T5 Models: To use the default settings for T5 models, set config/config.yaml as below:

    # default values:
    cluster: bcm  # Leave it as bcm even if using bcp. It will be ignored for bcp.
    data_preparation: t5/download_t5_pile
    training: t5/220m
    conversion: t5/convert_t5
    fine_tuning: t5/squad
    evaluation: t5/squad
    export: t5/export_t5
    
    stages:
      - data_preparation
      - training
      - conversion
      - fine_tuning
      - prompt_learning
      - evaluation
      - export
    
  • Settings for mT5 Models: To use the default settings for mT5 models, set config/config.yaml as below:

    # default values:
    cluster: bcm  # Leave it as bcm even if using bcp. It will be ignored for bcp.
    data_preparation: mt5/download_mc4
    training: mt5/390m
    conversion: mt5/convert_mt5
    fine_tuning: mt5/xquad
    evaluation: mt5/xquad
    export: mt5/export_mt5
    
    stages:
      - data_preparation
      - training
      - conversion
      - fine_tuning
      - prompt_learning
      - evaluation
      - export
    
  • Settings for BERT Models: To use the default settings for Bert models, set config/config.yaml as below:

    # default values:
    cluster: bcm  # Leave it as bcm even if using bcp. It will be ignored for bcp.
    data_preparation: bert/download_bert_pile
    training: bert/4b
    
    stages:
      - data_preparation
      - training
    

For other models, see the respective sections in the documentation.

To run the pipelines, execute:

python3 main.py

The entire repository uses hydra/omegaconf to handle job configuration using YAML files, so look at the documentation for those projects to learn more.