Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Understand the Configurations
Grasping the structure of preloaded configurations and knowing how to modify them is a crucial aspect of effectively utilizing the NeMo Launcher.
Overall Hierarchy:
The launcher utilizes hierarchical configurations, with the primary file located at conf/config.yaml
. All preloaded configurations, optimized and rigorously tested by NVIDIA, are found in the conf
folder, organized by stage. The structure of these configuration files is conf/(stage_name)/(model_type)/(model_name).yaml
.
Pipeline Configurations
The conf/config.yaml
file contains default configuration settings for various stages of your pipeline, including data preparation, training, fine-tuning, evaluation, and more. The stages
field specifies the stages that will be executed during the pipeline run.
defaults:
- _self_
- cluster: bcm # Leave it as bcm even if using bcp. It will be ignored for bcp.
- data_preparation: gpt3/download_gpt3_pile
- training: gpt3/5b
- conversion: null
- fine_tuning: null
- evaluation: gpt3/evaluate_all
- fw_inference: null
- export: null
- override hydra/job_logging: stdout
stages:
- data_preparation
- training
- fw_inference
Customize the Pipeline for Your Needs:
Include or exclude a stage: To include or exclude a stage in the pipeline, add or remove the stage name from the
stages
list.Modify stage configuration settings: To modify the configuration settings for a specific stage, navigate to the appropriate folder in the
conf
directory (e.g.,conf/training
for training options) and edit the relevant fields.Use a different configuration file: To use a different configuration file for a stage, update the corresponding field in the
defaults
section (e.g., changetraining: gpt3/5b
totraining: (model_type)/(model_name)
).Update specific model configurations: Modify the YAML files in
conf/(stage_name)/(model_type)/(model_name).yaml
to update specific stage configurations, such as the number of nodes, precision, and model configurations.
Cluster Configurations
The first parameter that must be set is the launcher_scripts_path
parameter inside the conf/config.yaml
file. This parameter must point to the absolute path where the launcher_scripts
folder (pulled from the container) is stored in the file system. Additionally, if using a Slurm based cluster, the config file in the subfolder of conf/cluster/bcm.yaml
has the parameters to set the generic cluster related information, such as the partition
or account
parameters. Tailor the cluster configuration below to match your cluster setup.
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-multimodal-"
Environment Variables Configurations
To configure or add additional environment variables when running pipelines, you can modify or include new fields under the env_vars section in the conf/config.yaml file. If a variable is set to null, it will be ignored.
env_vars:
NCCL_TOPO_FILE: null # Should be a path to an XML file describing the topology
UCX_IB_PCI_RELAXED_ORDERING: null # Needed to improve Azure performance
...
TRANSFORMER_OFFLINE: 1
NUMA Mapping Configurations
NUMA mapping is a technique used with multiple processors, where memory access times can vary depending on which processor is accessing the memory. The goal of NUMA mapping is to assign memory to processors in a way that minimizes non-uniform memory access times and ensures that each processor has access to the memory it needs with minimal delay. This technique is important for maximizing system performance in high-performance computing environments.
The NUMA mapping can also be configured from the conf/config.yaml
file. The mapping should be automatic; the code will read the number of CPU cores available in your cluster, and provide the best possible mapping, to maximize performance. The mapping is enabled by default, but it can be disabled by setting enable: False
in the numa_mapping
section of the conf/config.yaml
file. The type of mapping can also be configured using the same file. See the full config parameters below:
numa_mapping:
enable: True # Set to False to disable all mapping (performance will suffer).
mode: unique_contiguous # One of: all, single, single_unique, unique_interleaved or unique_contiguous.
scope: node # Either node or socket.
cores: all_logical # Either all_logical or single_logical.
balanced: True # Whether to assing an equal number of physical cores to each process.
min_cores: 1 # Minimum number of physical cores per process.
max_cores: 8 # Maximum number of physical cores per process. Can be null to use all available cores.