Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Prepare Environment
NeMo Framework uses a set of Docker containers executed locally on a single node or on a Slurm cluster (using the pyxis plug-in) or a Base Command Platform cluster. The container also includes conversion scripts. For inference, the container includes the NVIDIA Triton Inference Server with the TensorRT-LLM backend installed.
Note
Ensure that the high-speed file system is mounted on the job submission node(s) at the same path as on the compute nodes.
Slurm
The NeMo Framework codebase is included as part of the training
container. To copy it to a local directory in the cluster, you must
extract it from the container. You can execute the following command to copy the code to a directory named
/path/to/local/dir
. The NeMo Framework repository for Slurm has been
verified on Base Command Manager as well as Slurm-based DeepOps clusters.
srun -p <partition> -N 1 --container-mounts=/path/to/local/dir:/workspace/mount_dir --container-image=<container_tag> bash -c "cp -r /opt/NeMo-Framework-Launcher /workspace/mount_dir/"
Install the NeMo Framework scripts’ dependencies on the head node of the cluster:
pip install -r requirements.txt
You can use venv to kept your head node environment in its original state
for other Python projects. If your configuration lacks pip
, you can
install pip
from
get_pip.py
with just python3
. To install the dependencies to a virtual environment
run the following commands before the ‘pip install’ above:
python3 -m venv venv
source ./venv/bin/activate
Base Command Platform
The nemo_megatron_launcher
codebase is included as part of the training
container. Before you start, set up the NGC CLI and configure it as
described in Base Command Platform User Guide.
This guide’s examples mainly use two Base Command Platform workspaces, one for storing
the training dataset, and another for storing the results, checkpoints,
and logs. Therefore, start by creating these workspaces. NVIDIA suggests that you give them names like
nemo_megatron_data_ws
and nemo_megatron_results_ws
.
Base Command Platform User Guide explains how to create and work with
Base Command Platform workspaces.
Kubernetes
Data preparation, base model training, evaluation, and conversion of GPT models is currently supported on vanilla kubernetes (k8s) clusters. The launcher scripts will generate a Helm chart for each task based on the config files and launch the job using the chart.
- The following is required for running jobs on Kubernetes:
One or more DGX A100s/H100s as worker nodes
An NFS filesystem where the data and launcher scripts will be stored which is accessible on all worker and controller nodes
A head/controller node which has access to the worker nodes and can run
kubectl
andhelm
to launch jobs and can install Python dependenciesRecent versions of the GPU, Network, and KubeFlow Operators installed
A secret key needs to be configured to allow kubernetes to pull from the private registry. For example, if pulling the container directly from NGC, a secret needs to be created to authenticate with the private NGC registry, such as the following:
kubectl create secret docker-registry ngc-registry --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-password=<NGC KEY HERE>
The launcher scripts need to be downloaded to the NFS filesystem that is
connected to the worker nodes. This can either be copied at
/opt/NeMo-Framework-Launcher
from inside the training container or by cloning
this repository.
Install the NeMo Framework scripts dependencies on the head node/controller of the cluster where jobs will be launched:
pip install -r requirements.txt
General Configuration
All config files referenced in this section and throughout are in launcher_scripts
(except for the
AutoConfigurator’s config files which are in auto_configurator
).
The first configuration you must set is launcher_scripts_path
in the file conf/config.yaml
.
This configuration must point to the absolute path where the nemo_megatron_launcher
repository is stored in the file system.
Additionally, if you are using a Slurm-based cluster, the file (conf/cluster/bcm.yaml
)
has the configurations needed to set generic cluster related information, such as the partition
or account
.
For Kubernetes clusters, the k8s configuration file (conf/cluster/k8s.yaml
)
has necessary parameters to configure the scripts to run on the cluster, such as
ib_resource_name
, nfs_server
, and pull_secret
.
The NUMA mapping can also be configured from the file conf/config.yaml
.
The mapping should be automatic; the code will read the number of
CPU cores available in your cluster, and provide the best possible
mapping, to maximize performance. The mapping is enabled by default, but
it can be disabled by setting enable: False
in the numa_mapping
section of conf/config.yaml
. The type of mapping can also
be configured using the same file. See the full configurations below:
numa_mapping:
enable: True # Set to False to disable all mapping (performance will suffer).
mode: unique_contiguous # One of: all, single, single_unique, unique_interleaved or unique_contiguous.
scope: node # Either node or socket.
cores: all_logical # Either all_logical or single_logical.
balanced: True # Whether to assing an equal number of physical cores to each process.
min_cores: 1 # Minimum number of physical cores per process.
max_cores: 8 # Maximum number of physical cores per process. Can be null to use all available cores.
Slurm: The
launcher_scripts_path
will automatically be mounted to the container at the same path as in the local file system. Any additional directories that should be mounted must be specified using thecontainer_mounts
configuration. If the paths contain the colon character (:
), the code will assume both the source and destination paths are provided. Otherwise, the given paths will be mounted to the same path inside the container. Thedata_dir
configuration can also be modified to point to where the dataset will be loaded from or saved. Thebase_results_dir
can also be modified to point to where the results, checkpoints and logs will be stored. These last two directories will be automatically mounted into the container. The configurationscluster
andcluster_type
must be set tobcm
for all of the tasks.Base Command Platform: Set
launcher_scripts_path
to/opt/NeMo-Framework-Launcher/launcher_scripts
, the default location of the scripts in the container.Kubernetes: The
launcher_scripts_path
parameter needs to be set to the NFS path where the NeMo-Framework-Launcher code is located.Modify
data_dir
to point to the location from which the dataset is loaded or to which it is saved, andbase_results_dir
to the directory in which results, checkpoints, and logs are to be stored. In the case of Base Command Platform, NVIDIA recommends that you pointdata_dir``to one of the workspaces, and ``base_results_dir
to the other. Both must be mounted in read-write mode. The configurationcluster_type
must be set tobcp
for all of the tasks.The file
main.py
is the top-level file that is executed to run the data preparation, training, conversion, fine-tuning, and evaluation pipelines. The fileconf/config.yaml
contains configurations that determine which of these pipelines is run. In Slurm-based clusters all of them can be set toTrue
at the same time, and they will be executed in order. In Base Command Platform, though, only one of them may be set toTrue
at a given time.Settings for GPT Models: To use the default settings for GPT models, set
config/config.yaml
as below:stages: - data_preparation - training - conversion - evaluation - export
Settings for T5 Models: To use the default settings for T5 models, set
config/config.yaml
as below:# default values: cluster: bcm # Leave it as bcm even if using bcp. It will be ignored for bcp. data_preparation: t5/download_t5_pile training: t5/220m conversion: t5/convert_t5 fine_tuning: t5/squad evaluation: t5/squad export: t5/export_t5 stages: - data_preparation - training - conversion - fine_tuning - prompt_learning - evaluation - export
Settings for mT5 Models: To use the default settings for mT5 models, set
config/config.yaml
as below:# default values: cluster: bcm # Leave it as bcm even if using bcp. It will be ignored for bcp. data_preparation: mt5/download_mc4 training: mt5/390m conversion: mt5/convert_mt5 fine_tuning: mt5/xquad evaluation: mt5/xquad export: mt5/export_mt5 stages: - data_preparation - training - conversion - fine_tuning - prompt_learning - evaluation - export
Settings for BERT Models: To use the default settings for Bert models, set
config/config.yaml
as below:# default values: cluster: bcm # Leave it as bcm even if using bcp. It will be ignored for bcp. data_preparation: bert/download_bert_pile training: bert/4b stages: - data_preparation - training
For other models, see the respective sections in the documentation.
To run the pipelines, execute:
python3 main.py
The entire repository uses hydra/omegaconf
to handle job
configuration using YAML files, so look at the documentation for those
projects to learn more.