Using AutoConfigurator to Find the Optimal Configuration

AutoConfigurator searches for the hyperparameters (HPs) that achieve the highest throughput for training and inference for Large Language Models (LLMs) using the NeMo Framework.

Note

The inference HP search is only available for GPT models.

AutoConfigurator Capabilities

AutoConfigurator is intended to iterate over different model configurations quickly and find the best configuration, i.e. the one that costs the least in time and money. To achieve this, AutoConfigurator provides several different capabilities, as shown in the table below.

Feature	GPT	T5	mT5	BERT
Model Size Recommendation	Yes	Yes	Yes	Yes
Base Configuration Generation	Yes	Yes	Yes	Yes
Training HP Search	Yes	Yes	Yes	Yes
Parallel Training HP Search	BCM Only	BCM Only	BCM Only	BCM Only
Inference HP Search	BCM Only	No	No	No
Parallel Inference HP Search	BCM Only	No	No	No
Slurm-Based Clusters	Yes	Yes	Yes	Yes
Base Command Platform-Based Clusters	Yes	Yes	Yes	Yes
Kubernetes Clusters	No	No	No	No

Model Size Recommendation

If you have not decided what model size you want to train, AutoConfigurator can recommend a model size for your use case. If you know the number of GPUs, TFLOPS per GPU, the maximum time to train, and number of tokens to train for, it can recommend a model size that can be trained with the specified hardware and time constraints.

For example, if you had 20 NVIDIA DGX nodes available (in 80 GB GPU memory), and wanted to train a GPT model for a maximum of 5 days, AutoConfigurator would recommend using a 5B parameter GPT model.

Base Configuration Generation

When you provide the model size, or AutoConfigurator has suggested one, it generates a base configuration for the target model. The base configuration is a valid configuration in YAML format, which can be trained using the NeMo Framework. However, the throughput optimization will happen at the next step, Training AutoConfigurator HP Search.

Training AutoConfigurator HP Search

After AutoConfigurator generates the base configuration it searches over four critical hyperparameters that have great impact on training throughput, but do not affect model convergence: Tensor Parallelism (TP), Pipeline Parallelism (PP), Micro Batch Size (MBS), and Activation Checkpointing Layers (ActCkpt).

AutoConfigurator first uses heuristics to choose good candidates for those four parameters and generate a grid of candidate configurations. It saves all of the candidate configurations to the results directory. Each configuration includes a YAML file that specifies it.

Note

Some of the candidate configurations may not work, due to high memory usage or other issues. The next step eliminates most such configurations.

Once the candidate configurations are generated, Autoconfigurator uses heuristics to identify the most promising candidates. Then it uses the NeMo Framework to launch the most promising candidates in parallel. You can set the number of candidates to be launched with the limit_search_runs parameter.

The NeMo Framework trains each configuration for a maximum of max_minutes_per_run minutes and max_steps_per_run training steps, whichever is reached first on the target cluster. During this search, the jobs will run with the number of nodes specified in the configuration files, using the num_nodes parameter. Once all of the jobs have finished running, the final result will be summarized in a CSV file.

Inference AutoConfigurator HP Search

AutoConfigurator can also search the best HPs for inference purposes. It empirically measures the throughput and latency for each given configuration in the grid search space, and return a comprehensive table with all of the numbers. It searches over three different critical HPs that have great impact on the inference throughput and latency: Tensor Parallelism (TP), Pipeline Parallelism (PP), and Batch Size (BS).

Technically, AutoConfigurator is also capable of searching over different input/output sequence lengths. NVIDIA does not recommend using multiple sequence lengths in a search, though, because the model that used the shortest sequence lengths would always achieve the highest throughput and lowest latency. NVIDIA recommends instead that you perform several inference searches with different sequence lengths. When all of the jobs have finished running, it generates a CSV file that summarizes the final result.

Usage

This section explains how to run each of the stages described above.

General Configuration

Slurm

Before you try to run a configuration, you must copy the following directories from the container to the local file system:

/opt/NeMo-Framework-Launcher/auto_configurator
/opt/NeMo-Framework-Launcher/launcher_scripts

Set the auto_configurator_path parameter in conf/config.yaml to the absolute pathname of the auto_configurator directory.

You must also set parameters that specify generic cluster-related information, such as partition and account. These parameters are in the configuration file conf/cluster/bcm.yaml.

The path specified by the auto_configurator_path parameter is automatically mounted to the container at the same path as in the local file system. Any additional directories that are mounted must be specified using the container_mounts parameter. If a parameter value contains a colon (’:‘), the code assumes that both source and destination paths are provided. Otherwise, each given path is mounted to the same path inside the container.

launcher_scripts_path must point to the path where the launcher_scripts directory is located in the local file system. The location specified in the default configuration should be valid if /opt was extracted correctly.

The data_dir value must point to the path where the training dataset is located. Note that the datasets for GPT, T5 and mT5 values are different, so modify this parameter accordingly. Follow the data preparation steps to learn how to download and preprocess the datasets for each model. The dataset in this path need not be the full-size dataset; a small representative sample of the dataset is sufficient, since AutoConfigurator does not train the models to convergence.

You can modify the base_results_dir parameter to point to the location where the results will be stored.

Following is a list of all of the parameters in the conf/config.yaml file:

defaults:
  - _self_
  - cluster: bcm
  - search_config: gpt3/5b
  - override hydra/job_logging: stdout

run_training_hp_search: True
run_inference_hp_search: True

cluster_type: bcm  # bcm or bcp
auto_configurator_path: ???  # Path to the location of auto_configurator codebase.
launcher_scripts_path: ${auto_configurator_path}/../launcher_scripts
base_results_dir: ${auto_configurator_path}/results
data_dir: ${launcher_scripts_path}/data
training_container: nvcr.io/ea-bignlp/bignlp-training:23.03-py3
container_mounts:
    - null
wandb:  # Weights and Biases (W&B) logging.
  enable: False  # Whether to save logs to W&B.
  api_key_file: null # Path to the file where the w&B api key is stored. Key must be on the first line.
  project: nemo-megatron-autoconfig # Name of the W&B project to store the logs in. The name of the run will be populated automatically.

Base Command Platform

In Base Command Platform, the dataset, vocabulary, and merge files used for the training HP search must be available as a dataset and mounted accordingly. This guide assumes the dataset are mounted to /mount/data. The results of running the AutoConfigurator are stored in /mount/results/auto_configurator, so NVIDIA recommends that you mount a workspace to /mount/results.

The main configuration file is in conf/config.yaml. All of the parameters can be overridden from the command line, as the next section shows.

Predefined Configurations

NVIDIA provides predefined configurations that have been thoroughly tested, and the outputs produced by AutoConfigurator have been verified manually.

Running one of these configurations first generates a base configuration file for the specified model size, then launches the training and inference grid search jobs. When all of the jobs have finished, it produces a final recommendation for both training and inference, and shows the optimal hyperparameters for the given model.

The predefined configurations are stored in the directory conf/search_config. Each YAML file shows one model type (GPT, T5, or mT5) and one model size (up to 175B parameters for GPT and up to 42B parameters for T5 and mT5).

To run a configuration, you must modify the search_config parameter in conf/config.yaml. For example, to run a 5B GPT model, you would set this value to gpt3/5b. (Do not specify the .yaml extension.)

Model Configuration

To run the gpt3/5b configuration, you must set up conf/search_config/gpt3/5b.yaml.

train_settings:
  model_size_in_b: 5 # unit in billion parameters
  num_nodes: 16
  gpus_per_node: 8
  gpu_memory_gb: 80  # Memory per GPU, in GB. Currently 40GB and 80GB A100s supported.
  max_training_days: 5 # unit in days
  limit_search_runs: 100 # Max number of runs to be launched in parallel for grid search.
  output_top_n: 10  # The result will print the top N fastest training configs.
  max_steps_per_run: 50 # Max steps per run for the grid search.
  max_minutes_per_run: 10 # minutes per run for the grid search.
  tflops_per_gpu: 140  # Estimated tflops per GPU.
  num_tokens_in_b: 300  # Unit in billions, typically 300B for GPT3 models.
  vocab_size: 51200
  logs: ${base_results_dir}/${search_config_value}_${.gpu_memory_gb}gb  # Example base_results_dir/gpt3/126m
  tensor_parallel_sizes: auto  # auto to use our recommendation, or a list, such as [1, 2, 4, 8]
  pipeline_parallel_sizes: auto  # auto to use our recommendation, or a list, such as [1, 2, 4, 8, 10]
  min_model_parallel_size: auto  # auto to use our recommendation, or a value for the minimum desired parallelism
  max_model_parallel_size: auto  # auto to use our recommendation, or a value for the maximum desired parallelism
  micro_batch_sizes: auto  # auto to use our recommendation, or a list, such as [1, 2, 4, 8, 16]
  act_ckpt_layers: auto  # auto to use our recommendation, or a list, such as [0, 1, 2, 3]

inference_settings:
  run:
    model_type: gpt3
    model_train_name: gpt3_5b
    gpus_per_node: 8
    data_type: "fp16" # fp32|fp16|bf16
    time_limit: 0:30:00
    results_dir: ${base_results_dir}/${search_config_value}_${search_config.train_settings.gpu_memory_gb}gb
    tensor_parallel_sizes: [1,2,4]
    pipeline_parallel_sizes: [1,2]
  benchmark:
    input_len: 60
    output_len: 20
    batch_sizes: [4,8,16,32,64,128,256]
    beam_width: 1
    topk: 4
    topp: 0.0

For the training configurations, the model_size_in_b parameter indicates how many billions of parameters the model is to contain. AutoConfigurator provides a configuration and HPs for a model of that size.

The num_nodes parameter indicates how many nodes AutoConfigurator is to use to run each training job. gpus_per_node indicates how many GPUs are available in each node. To modify the behavior of the heuristics depending on whether 40GB or 80GB A100 GPUs are available, you can set gpu_memory_gb to 40 or 80, causing Autoconfigurator to recommend candidate configurations that are suitable for that setting.

Autoconfigurator writes a setting for the max_training_days parameter to the final YAML configuration files, which specifies how many days this model will be trained for when training to full convergence. You can also use this parameter when model_size_in_b is set to null.

The limit_search_runs parameter can be used to limit the number of configurations that will be searched during the HP search stage. AutoConfigurator typically must search at least 30 configurations to find the optimal one, so limit_search_runs must be set to a value at least that great. If your cluster provides sufficient computing resources, though, NVIDIA recommends increasing this parameter to a value close to 100.

You can use the output_top_n parameter to specify how much detail the output summary file is to include. Its default value is 10, which outputs the top 10 configurations.

The max_steps_per_run parameter indicates how many steps to train each configuration for. The max_minutes_per_run parameter indicates how many minutes to run each configuration. NVIDIA recommends allowing at least 20 minutes per run for the smaller models, and over 60 minutes per run for the larger models. The training run is stopped when either max_steps_per_run or max_minutes_per_run is reached.

The tflops_per_gpu parameter provides an estimate of the TFLOPs each GPU can achieve when training large language models with the NeMo Framework. This value is only used to provide an estimate of how long the model will take to train to full convergence, so you can know the approximate time to train before you begin training your model.

The num_tokens_in_b parameter indicates how many billions of tokens your model is to be trained for, when training to full convergence. It is used to estimate how much time the model will take to train to the desired number of tokens.

The vocab_size parameter must specify the vocabulary size to be used during training.

Set the logs parameter to specify where the result logs are to be saved. By default, Autoconfigurator creates this directory inside the directory specified by the``base_results_dir`` parameter in conf/config.yaml.

If you leave the tensor_parallel_sizes, pipeline_parallel_sizes, min_model_parallel_size, max_model_parallel_size, micro_batch_sizes, and act_ckpt_layers parameters set to their default value of auto, AutoConfigurator selects appropriate values. However, you can change these parameters to override the heuristics that choose the grid search space and the maximum and minimum parallelism allowed for each model. For example, if you only want to search for configurations with Tensor Parallelism (TP) values of 1 or 2, you can set tensor_parallel_sizes: [1, 2]. (In this case you would leave the other configurations as auto.)

In the inference parameters:

Set gpus_per_node to specify the the number of GPUs available in each node.
Set tensor_parallel_sizes to specify the TP values to perform the HP search.
Set pipeline_parallel_sizes to specify the PP values to perform the HP search.
Set batch_sizes to specify all of the possible batch sizes for the HP search.
You may set input_len to the sequence length of the input to be passed to the model.
You may set output_len to the output length to be produced by the model.

Base Configuration Generation

Every time you call python3 main.py it generates a base configuration for the given model and saves it to the log file directory specified in the appropriate one of your configuration files.

The base configuration consists of a YAML file that you can run using the NeMo Framework’s training container. This base configuration is not yet optimized to achieve the highest possible throughput; the optimization takes place in the next step, “Training Autoconfigurator HP Search.”

Training AutoConfigurator HP Search

Slurm

To run the training HP search pipeline, you must set the run_training_hp_search parameter to True in conf/config.yaml. You must select the model to be used to search the best training HPs using the search_config parameter in conf/config.yaml. For example, if``search_config`` is set to gpt3/5b (its default value), AutoConfigurator searches the optimal training HPs for a 5B parameter GPT model. The configuration for this model can be found in conf/search_config/gpt3/5b.yaml.

You can modify the following parameters in the corresponding YAML file to configure the behavior of the HP search.

After you have set all of the parameters, call python3 main.py to run the training AutoConfigurator HP search.

Base Command Platform

To run the HP Tool in BCP, set the cluster_type configuration to bcp. You can configure all of the parameters through CLI overrides. For example, to launch a training HP search for the 126M GPT model, enter this command:

python3 /opt/NeMo-Framework-Launcher/auto_configurator/main.py search_config=gpt3/0.126b run_inference_hp_search=False auto_configurator_path=/opt/NeMo-Framework-Launcher/auto_configurator data_dir=/mount/data/the_pile_gpt3 base_results_dir=/mount/results/auto_configurator search_config.train_settings.num_nodes=$NGC_ARRAY_SIZE cluster_type=bcp

This command assumes that the dataset directory and the results directory are datasets and workspaces mounted correctly. The user can also override any training configurations, by overriding any parameter in the search_config dictionary with the search_config.train_settings.* parameter, using hydra overrides. The values that can be overridden are shown below:

train_settings:
  model_size_in_b: 5 # unit in billion parameters
  num_nodes: 16
  gpus_per_node: 8
  gpu_memory_gb: 80  # Memory per GPU, in GB. Currently 40GB and 80GB A100s supported.
  max_training_days: 5 # unit in days
  limit_search_runs: 100 # Max number of runs to be launched in parallel for grid search.
  output_top_n: 10  # The result will print the top N fastest training configs.
  max_steps_per_run: 50 # Max steps per run for the grid search.
  max_minutes_per_run: 10 # minutes per run for the grid search.
  tflops_per_gpu: 140  # Estimated tflops per GPU.
  num_tokens_in_b: 300  # Unit in billions, typically 300B for GPT3 models.
  vocab_size: 51200
  logs: ${base_results_dir}/${search_config_value}_${.gpu_memory_gb}gb  # Example base_results_dir/gpt3/126m
  tensor_parallel_sizes: auto  # auto to use our recommendation, or a list, such as [1, 2, 4, 8]
  pipeline_parallel_sizes: auto  # auto to use our recommendation, or a list, such as [1, 2, 4, 8, 10]
  min_model_parallel_size: auto  # auto to use our recommendation, or a value for the minimum desired parallelism
  max_model_parallel_size: auto  # auto to use our recommendation, or a value for the maximum desired parallelism
  micro_batch_sizes: auto  # auto to use our recommendation, or a list, such as [1, 2, 4, 8, 16]
  act_ckpt_layers: auto  # auto to use our recommendation, or a list, such as [0, 1, 2, 3]

Inference AutoConfigurator HP

To run the inference HP search pipeline, you must set the run_inference_hp_search parameter to True in the conf/config.yaml file. Select the model to be used to search the best inference HPs with the search_config parameter. For example, if this parameter is set to gpt3/5b (the default), AutoConfigurator search the optimal inference HPs for a 5B parameter GPT model. The configuration for this model is in conf/search_config/gpt3/5b.yaml.

To configure the behavior of the HP search, you can modify the following parameters in the corresponding YAML file.

Running Custom Model Size Configurations

The HP Tool can recommend a model size based on your hardware and training time constraints. For instance, if you want to train a GPT model but don’t know what model size is appropriate, you can enter the number of nodes and GPUs per node available in your cluster and the amount of time you want to spend training the model, and AutoConfigurator will recommend a model size that can be trained in that time with your hardware.

For an example of this, see the file conf/search_config/gpt3/unknown_size.yaml. The model_size_in_b parameter in this file is set to null. This tells it to recommend a model size.

For the recommendation to work correctly, you must specify:

The number of available nodes (the num_nodes parameter)
The number of available GPUs per node (gpus_per_node)
How long to train the model (max_training_days)
Vocabulary size (vocab _size)
Number of billions of tokens to train the model for (num_tokens_in_b)
The estimated TFLOPS per GPU your hardware can achieve

Once all of these parameters are set correctly, set the search_config parameter in conf/config.yaml to gpt3/unknown_size specifying the configuration to run. The training pipeline can then be executed by entering the command``python3 main.py``. This produces a base configuration for the suggested model size. If run_training_hp_search or run_inference_hp_search is set to True, it also searches for the HPs for training or inference, using the rest of the configuration file as input. AutoConfigurator behaves the same way as when it is using a predefined configuration.

Interpreting the Results

After AutoConfigurator generates the base configuration for a model, it saves that base configuration in the directory specified in the logs parameter in the corresponding configuration file. By default, this is:

.../results/<model_name>/<model_size>_<gpu_mem_size>/

<model_name>, /<model_size>, and <gpu_mem_size> represent the values of configurations. The default value of search_config is set to gpt3/5b and the default value of gpu_memory_gb is set to 80, so by default Autoconfigurator stores the results in the directory .../results/gpt3/5b_80gb/ . The base configuration is stored in that directory with the name base_cfg_<model_size>.yaml.

When you run the training HP search pipeline, it stores results in three subdirectories in the logs directory.

candidate_configs contains all of the YAML files with all of the configurations generated by the HP search.
training_logs contains all of the logs of training each of the individual configurations AutoConfigurator generated. If limit_search_runs was set to 30, then there are 30 subdirectories with the logs for each of the 30 models.
final_results contains:
- A log file that lists the output_top_n fastest configurations, sorted from fastest to slowest, created after all of the training runs have completed and the final run has analyzed the throughput of each configuration.
- A CSV file that contains all of the results from every configuration that AutoConfigurator ran for a given model size, sorted from fastest to slowest. This file contains information such as the samples per second achieved by each configuration, the time per global step, the TFLOPS per GPU achieved, and so on.
- A YAML file that corresponds to the configuration with the lowest training time. This is the recommended model for training.

For the inference HP search, Autoconfigurator stores the results in the directory specified in the results_dir parameter of the YAML configuration file. The results are stored in that directory at the relative pathname inference/final_summary/final_output.csv . This csv file contains the results of every model that was run by the AutoConfigurator HP search.

Note

The result of the Training HP Search varies when it is run with different numbers of nodes. This is mainly caused by the new distributed optimizer, which provides higher memory savings when using more nodes (i.e. a higher data parallel value).

Logging Runs with Weights and Biases

You can use Weights and Biases (W&B) to log all of the training search runs by changing the values of the wandb parameters in the conf/config.yaml file.

Set enable to True.
Set api_key_file to point the pathname of the file that contains the W&B API key. The API key must be in the first line of the file.
Set project to specify the name of the W&B project where the metrics are to be stored. You need not provide the name of each run; Autoconfigurator generates the names automatically, using the model name, model size, and hyperparameters specified for the run.

wandb:  # Weights and Biases (W&B) logging.
    enable: True
    api_key_file: null
    project: nemo-megatron-autoconfig