Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Model Training#
Define the configuration used for the training pipeline by setting the training configuration in conf/config.yaml
Setting the configuration to gpt3/5b, specifies the configuration file as conf/training/gpt3/5b.yaml.
Modify the configuration to adust the hyperparameters of the training runs.
All supported model types and sizes are stored in the directory conf/training.
The value training must be included in stages to run the training pipeline.
Slurm#
Define the configuration for a Slurm cluster in conf/cluster/bcm.yaml:
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"
Set the job-specific training configurations in
the run section of conf/training/<model_type>/<model_size>.yaml:
run:
name: gpt3_5b
results_dir: ${base_results_dir}/${.name}
time_limit: "1-12:00:00"
dependency: "singleton"
To run only the training pipeline and not the data preparation,
evaluation, or inference pipelines, set the stages section of conf/config.yaml to:
stages:
- training
Then enter:
python3 main.py
Base Command Platform#
Select the cluster-related configuration according to the NGC
documentation. Then enter python3 main.py to launch the
job and override the training job values of any configurations you need to change.
Kubernetes#
Define the configuration for a Kubernetes cluster in conf/cluster/k8s.yaml:
pull_secret: null # Kubernetes secret for the container registry to pull private containers.
shm_size: 512Gi # Amount of system memory to allocate in Pods. Should end in "Gi" for gigabytes.
nfs_server: null # Hostname or IP address for the NFS server where data is stored.
nfs_path: null # Path to store data in the NFS server.
ib_resource_name: "nvidia.com/hostdev" # Specify the resource name for IB devices according to kubernetes, such as "nvidia.com/hostdev" for Mellanox IB adapters.
ib_count: "8" # Specify the number of IB devices to include per node in each pod.
Set the job-specific training parameters in
the run section of conf/training/<model_type>/<model_size>.yaml:
run:
name: gpt3_5b
results_dir: ${base_results_dir}/${.name}
time_limit: "1-12:00:00"
dependency: "singleton"
Set the cluster and cluster_type settings to k8s in conf/config.yaml.
To run only the training pipeline and not the data preparation,
evaluation, or inference pipelines, set the stages section of conf/config.yaml to:
stages:
- training
Then enter:
python3 main.py
This will launch a Helm chart based on the training configurations which will spawn one pod for each node including any networking fabrics as specified in the cluster settings for distributed training.