AutoConfigurator searches for the hyperparameters (HPs) that achieve the highest throughput for training and inference for Large Language Models (LLMs) using the NeMo framework.
The inference HP search is only available for GPT models.
AutoConfigurator is intended to iterate over different model configurations quickly and find the best configuration, i.e. the one that costs the least in time and money. To achieve this, AutoConfigurator provides several different capabilities, as shown in the table below.
Feature |
GPT |
T5 |
mT5 |
BERT |
---|---|---|---|---|
Model Size Recommendation |
Yes |
Yes |
Yes |
Yes |
Base Configuration Generation |
Yes |
Yes |
Yes |
Yes |
Training HP Search |
Yes |
Yes |
Yes |
Yes |
Parallel Training HP Search |
BCM Only |
BCM Only |
BCM Only |
BCM Only |
Inference HP Search |
BCM Only |
No |
No |
No |
Parallel Inference HP Search |
BCM Only |
No |
No |
No |
Slurm-Based Clusters |
Yes |
Yes |
Yes |
Yes |
Base Command Platform-Based Clusters |
Yes |
Yes |
Yes |
Yes |
Kubernetes Clusters |
No |
No |
No |
No |
Model Size Recommendation
If you have not decided what model size you want to train, AutoConfigurator can recommend a model size for your use case. If you know the number of GPUs, TFLOPS per GPU, the maximum time to train, and number of tokens to train for, it can recommend a model size that can be trained with the specified hardware and time constraints.
For example, if you had 20 NVIDIA DGX nodes available (in 80 GB GPU memory), and wanted to train a GPT model for a maximum of 5 days, AutoConfigurator would recommend using a 5B parameter GPT model.
Base Configuration Generation
When you provide the model size, or AutoConfigurator has suggested one, it generates a base configuration for the target model. The base configuration is a valid configuration in YAML format, which can be trained using the NeMo framework. However, the throughput optimization will happen at the next step, Training AutoConfigurator HP Search.
Training AutoConfigurator HP Search
After AutoConfigurator generates the base configuration it searches over four critical hyperparameters that have great impact on training throughput, but do not affect model convergence: Tensor Parallelism (TP), Pipeline Parallelism (PP), Micro Batch Size (MBS), and Activation Checkpointing Layers (ActCkpt).
AutoConfigurator first uses heuristics to choose good candidates for those four parameters and generate a grid of candidate configurations. It saves all of the candidate configurations to the results directory. Each configuration includes a YAML file that specifies it.
Some of the candidate configurations may not work, due to high memory usage or other issues. The next step eliminates most such configurations.
Once the candidate configurations are generated, Autoconfigurator uses
heuristics to identify the most promising candidates. Then it
uses the NeMo framework to launch the most promising candidates in parallel. You can set the number
of candidates to be launched with the limit_search_runs
parameter.
The NeMo framework trains each configuration
for a maximum of max_minutes_per_run
minutes and
max_steps_per_run
training steps, whichever is reached first on the
target cluster. During this search, the jobs will run with the number of
nodes specified in the configuration files, using the num_nodes
parameter. Once all of the jobs have finished running, the final result
will be summarized in a CSV file.
Inference AutoConfigurator HP Search
AutoConfigurator can also search the best HPs for inference purposes. It empirically measures the throughput and latency for each given configuration in the grid search space, and return a comprehensive table with all of the numbers. It searches over three different critical HPs that have great impact on the inference throughput and latency: Tensor Parallelism (TP), Pipeline Parallelism (PP), and Batch Size (BS).
Technically, AutoConfigurator is also capable of searching over different input/output sequence lengths. NVIDIA does not recommend using multiple sequence lengths in a search, though, because the model that used the shortest sequence lengths would always achieve the highest throughput and lowest latency. NVIDIA recommends instead that you perform several inference searches with different sequence lengths. When all of the jobs have finished running, it generates a CSV file that summarizes the final result.
This section explains how to run each of the stages described above.
General Configuration
Slurm
Before you try to run a configuration, you must copy the following directories from the container to the local file system:
/opt/NeMo-Megatron-Launcher/auto_configurator
/opt/NeMo-Megatron-Launcher/launcher_scripts
/opt/FasterTransformer
Set the auto_configurator_path
parameter in conf/config.yaml
to the absolute pathname of the auto_configurator
directory.
You must also set parameters that specify generic cluster-related information, such as partition
and account
. These parameters are in the configuration file conf/cluster/bcm.yaml
.
The path specified by the auto_configurator_path
parameter is automatically mounted
to the container at the same path as in the local file system. Any
additional directories that are mounted must be specified using
the container_mounts
parameter. If a parameter value contains a colon
(’:
‘), the code assumes that both source and destination
paths are provided. Otherwise, each given path is mounted to the
same path inside the container.
launcher_scripts_path
and fastertransformer_path
must point
to the path where the launcher_scripts
and FasterTransformer
directories are located in the local file system. The locations
specified in the default configuration should be valid if /opt
was
extracted correctly.
The data_dir
value must point to the path
where the training dataset is located. Note that the datasets for GPT,
T5 and mT5 values are different, so modify this parameter
accordingly. Follow the data preparation steps to learn how to download
and preprocess the datasets for each model. The dataset in this path
need not be the full-size dataset; a small representative
sample of the dataset is sufficient, since AutoConfigurator does not train
the models to convergence.
You can modify the base_results_dir
parameter
to point to the location where the results will be
stored.
Following is a list of all of the parameters in the conf/config.yaml
file:
defaults:
- _self_
- cluster: bcm
- search_config: gpt3/5b
- override hydra/job_logging: stdout
run_training_hp_search: True
run_inference_hp_search: True
cluster_type: bcm # bcm or bcp
auto_configurator_path: ??? # Path to the location of auto_configurator codebase.
launcher_scripts_path: ${auto_configurator_path}/../launcher_scripts
fastertransformer_path: ${auto_configurator_path}/../FasterTransformer
base_results_dir: ${auto_configurator_path}/results
data_dir: ${launcher_scripts_path}/data
training_container: nvcr.io/ea-bignlp/bignlp-training:23.03-py3
container_mounts:
- null
wandb: # Weights and Biases (W&B) logging.
enable: False # Whether to save logs to W&B.
api_key_file: null # Path to the file where the w&B api key is stored. Key must be on the first line.
project: nemo-megatron-autoconfig # Name of the W&B project to store the logs in. The name of the run will be populated automatically.
Base Command Platform
In Base Command Platform, the dataset, vocabulary, and merge files used
for the training HP search must be available as a dataset and mounted
accordingly. This guide assumes the dataset are mounted to
/mount/data
. The results of running the AutoConfigurator are
stored in /mount/results/auto_configurator
, so NVIDIA recommends that you mount
a workspace to /mount/results
.
The main configuration file is in conf/config.yaml
. All of the parameters can be overridden from the command line, as the next section shows.
Predefined Configurations
NVIDIA provides predefined configurations that have been thoroughly tested, and the outputs produced by AutoConfigurator have been verified manually.
Running one of these configurations first generates a base configuration file for the specified model size, then launches the training and inference grid search jobs. When all of the jobs have finished, it produces a final recommendation for both training and inference, and shows the optimal hyperparameters for the given model.
The predefined configurations are stored in the directory conf/search_config
.
Each YAML file shows one model type (GPT, T5, or mT5) and
one model size (up to 175B parameters for GPT and up to 42B parameters
for T5 and mT5).
To run a configuration, you must modify the
search_config
parameter in conf/config.yaml
. For
example, to run a 5B GPT model, you would set this value to
gpt3/5b
. (Do not specify the .yaml
extension.)
Model Configuration
To run the gpt3/5b
configuration, you must set up
conf/search_config/gpt3/5b.yaml
.
train_settings:
model_size_in_b: 5 # unit in billion parameters
num_nodes: 16
gpus_per_node: 8
gpu_memory_gb: 80 # Memory per GPU, in GB. Currently 40GB and 80GB A100s supported.
max_training_days: 5 # unit in days
limit_search_runs: 100 # Max number of runs to be launched in parallel for grid search.
output_top_n: 10 # The result will print the top N fastest training configs.
max_steps_per_run: 50 # Max steps per run for the grid search.
max_minutes_per_run: 10 # minutes per run for the grid search.
tflops_per_gpu: 140 # Estimated tflops per GPU.
num_tokens_in_b: 300 # Unit in billions, typically 300B for GPT3 models.
vocab_size: 51200
logs: ${base_results_dir}/${search_config_value}_${.gpu_memory_gb}gb # Example base_results_dir/gpt3/126m
tensor_parallel_sizes: auto # auto to use our recommendation, or a list, such as [1, 2, 4, 8]
pipeline_parallel_sizes: auto # auto to use our recommendation, or a list, such as [1, 2, 4, 8, 10]
min_model_parallel_size: auto # auto to use our recommendation, or a value for the minimum desired parallelism
max_model_parallel_size: auto # auto to use our recommendation, or a value for the maximum desired parallelism
micro_batch_sizes: auto # auto to use our recommendation, or a list, such as [1, 2, 4, 8, 16]
act_ckpt_layers: auto # auto to use our recommendation, or a list, such as [0, 1, 2, 3]
inference_settings:
run:
model_type: gpt3
model_train_name: gpt3_5b
gpus_per_node: 8
data_type: "fp16" # fp32|fp16|bf16
time_limit: 0:30:00
results_dir: ${base_results_dir}/${search_config_value}_${search_config.train_settings.gpu_memory_gb}gb
tensor_parallel_sizes: [1,2,4]
pipeline_parallel_sizes: [1,2]
benchmark:
input_len: 60
output_len: 20
batch_sizes: [4,8,16,32,64,128,256]
beam_width: 1
topk: 4
topp: 0.0
For the training configurations, the model_size_in_b
parameter indicates
how many billions of parameters the model is to contain.
AutoConfigurator provides a configuration and HPs for a model of that size.
The num_nodes
parameter indicates how many nodes AutoConfigurator
is to use to run each training job. gpus_per_node
indicates how many GPUs are available in each node. To modify the
behavior of the heuristics depending on whether 40GB or 80GB A100 GPUs
are available, you can set gpu_memory_gb
to 40 or 80,
causing Autoconfigurator to recommend candidate configurations that are
suitable for that setting.
Autoconfigurator writes a setting for the max_training_days
parameter to the final YAML configuration files, which specifies how
many days this model will be trained for when training to full
convergence. You can also use this
parameter when model_size_in_b
is set to null
.
The limit_search_runs
parameter can be used to limit the number of
configurations that will be searched during the HP search stage. AutoConfigurator typically must search at least 30
configurations to find the optimal one, so limit_search_runs
must be set to a value at least that great. If your cluster provides sufficient computing resources, though, NVIDIA recommends increasing this parameter to a
value close to 100.
You can use the output_top_n
parameter to
specify how much detail the output summary file is to include. Its
default value is 10, which outputs the top 10 configurations.
The max_steps_per_run
parameter indicates how many steps to train each
configuration for. The max_minutes_per_run
parameter indicates how
many minutes to run each configuration. NVIDIA recommends allowing at
least 20 minutes per run for the smaller models, and
over 60 minutes per run for the larger models. The training run is stopped
when either max_steps_per_run
or max_minutes_per_run
is reached.
The tflops_per_gpu
parameter provides an estimate of the TFLOPs each
GPU can achieve when training large language models with the NeMo framework.
This value is only used to provide an estimate of how long the model
will take to train to full convergence, so you can know the approximate time to
train before you begin training your model.
The num_tokens_in_b
parameter indicates how many billions of tokens your model is to be trained for, when training to full convergence. It is used to
estimate how much time the model will take to train to the desired
number of tokens.
The vocab_size
parameter must specify the vocabulary
size to be used during training.
Set the logs
parameter
to specify where the result logs are to be saved. By default, Autoconfigurator creates this
directory inside the directory specified by the``base_results_dir`` parameter in
conf/config.yaml
.
If you leave the tensor_parallel_sizes
,
pipeline_parallel_sizes
, min_model_parallel_size
,
max_model_parallel_size
, micro_batch_sizes
, and
act_ckpt_layers
parameters set to their default value of auto
,
AutoConfigurator selects appropriate values.
However, you can change these parameters
to override the heuristics
that choose the grid search space and the maximum and minimum
parallelism allowed for each model.
For example, if you only
want to search for configurations with Tensor Parallelism (TP) values of
1 or 2, you can set tensor_parallel_sizes: [1, 2]
. (In this case you would leave the
other configurations as auto
.)
In the inference parameters:
Set
gpus_per_node
to specify the the number of GPUs available in each node.Set
tensor_parallel_sizes
to specify the TP values to perform the HP search.Set
pipeline_parallel_sizes
to specify the PP values to perform the HP search.Set
batch_sizes
to specify all of the possible batch sizes for the HP search.You may set
input_len
to the sequence length of the input to be passed to the model.You may set
output_len
to the output length to be produced by the model.
Base Configuration Generation
Every time you call python3 main.py
it
generates a base configuration for the given model and saves it to the
log file directory specified in the appropriate one of your configuration files.
The base configuration consists of a YAML file that you can run using the NeMo framework’s training container. This base configuration is not yet optimized to achieve the highest possible throughput; the optimization takes place in the next step, “Training Autoconfigurator HP Search.”
Training AutoConfigurator HP Search
Slurm
To run the training HP search pipeline, you must set the
run_training_hp_search
parameter to True
in
conf/config.yaml
. You must select the model to be used to search the best training
HPs using the search_config
parameter in
conf/config.yaml
. For example, if``search_config`` is set
to gpt3/5b
(its default value), AutoConfigurator searches the optimal training
HPs for a 5B parameter GPT model. The configuration for this model can
be found in conf/search_config/gpt3/5b.yaml
.
You can modify the following parameters in the corresponding YAML file to configure the behavior of the HP search.
After you have set all of the parameters, call python3 main.py
to run the training AutoConfigurator HP search.
Base Command Platform
To run the HP Tool in BCP, set the cluster_type
configuration to
bcp
. You can configure all of the parameters through CLI overrides. For
example, to launch a training HP search for the 126M GPT model, enter
this command:
python3 /opt/NeMo-Megatron-Launcher/auto_configurator/main.py search_config=gpt3/0.126b run_inference_hp_search=False auto_configurator_path=/opt/NeMo-Megatron-Launcher/auto_configurator data_dir=/mount/data/the_pile_gpt3 base_results_dir=/mount/results/auto_configurator search_config.train_settings.num_nodes=$NGC_ARRAY_SIZE cluster_type=bcp
This command assumes that the dataset directory and the results
directory are datasets and workspaces mounted correctly. The user can
also override any training configurations, by overriding any parameter in
the search_config
dictionary with the
search_config.train_settings.*
parameter, using hydra overrides. The
values that can be overridden are shown below:
train_settings:
model_size_in_b: 5 # unit in billion parameters
num_nodes: 16
gpus_per_node: 8
gpu_memory_gb: 80 # Memory per GPU, in GB. Currently 40GB and 80GB A100s supported.
max_training_days: 5 # unit in days
limit_search_runs: 100 # Max number of runs to be launched in parallel for grid search.
output_top_n: 10 # The result will print the top N fastest training configs.
max_steps_per_run: 50 # Max steps per run for the grid search.
max_minutes_per_run: 10 # minutes per run for the grid search.
tflops_per_gpu: 140 # Estimated tflops per GPU.
num_tokens_in_b: 300 # Unit in billions, typically 300B for GPT3 models.
vocab_size: 51200
logs: ${base_results_dir}/${search_config_value}_${.gpu_memory_gb}gb # Example base_results_dir/gpt3/126m
tensor_parallel_sizes: auto # auto to use our recommendation, or a list, such as [1, 2, 4, 8]
pipeline_parallel_sizes: auto # auto to use our recommendation, or a list, such as [1, 2, 4, 8, 10]
min_model_parallel_size: auto # auto to use our recommendation, or a value for the minimum desired parallelism
max_model_parallel_size: auto # auto to use our recommendation, or a value for the maximum desired parallelism
micro_batch_sizes: auto # auto to use our recommendation, or a list, such as [1, 2, 4, 8, 16]
act_ckpt_layers: auto # auto to use our recommendation, or a list, such as [0, 1, 2, 3]
Inference AutoConfigurator HP
To run the inference HP search pipeline, you must set the run_inference_hp_search
parameter to True
in the
conf/config.yaml
file. Select the model to be used to search the best inference
HPs with the search_config
parameter. For example, if this parameter is
set to gpt3/5b
(the default), AutoConfigurator search the optimal
inference HPs for a 5B parameter GPT model. The configuration for this
model is in conf/search_config/gpt3/5b.yaml
.
To configure the behavior of the HP search, you can modify the following parameters in the corresponding YAML file.
Running Custom Model Size Configurations
The HP Tool can recommend a model size based on your hardware and training time constraints. For instance, if you want to train a GPT model but don’t know what model size is appropriate, you can enter the number of nodes and GPUs per node available in your cluster and the amount of time you want to spend training the model, and AutoConfigurator will recommend a model size that can be trained in that time with your hardware.
For an example of this, see the file
conf/search_config/gpt3/unknown_size.yaml
. The
model_size_in_b
parameter in this file is set to null. This tells
it to recommend a model size.
For the recommendation to work correctly, you must specify:
The number of available nodes (the
num_nodes
parameter)The number of available GPUs per node (
gpus_per_node
)How long to train the model (
max_training_days
)Vocabulary size (
vocab _size
)Number of billions of tokens to train the model for (
num_tokens_in_b
)The estimated TFLOPS per GPU your hardware can achieve
Once all of these parameters are set correctly, set the
search_config
parameter in conf/config.yaml
to gpt3/unknown_size
specifying the configuration to run. The
training pipeline can then be executed by entering the command``python3 main.py``. This
produces a base configuration for the suggested model size. If
run_training_hp_search
or run_inference_hp_search
is set to
True
, it also searches for the HPs for training or inference,
using the rest of the configuration file as input. AutoConfigurator behaves the same way as when it is using a predefined configuration.
Interpreting the Results
After AutoConfigurator generates the base configuration for a model, it
saves that base configuration in the directory specified in the logs
parameter
in the corresponding configuration file. By default, this is:
.../results/<model_name>/<model_size>_<gpu_mem_size>/
<model_name>
, /<model_size>
, and <gpu_mem_size>
represent the values of configurations. The default value of search_config
is set to gpt3/5b
and the default value of
gpu_memory_gb
is set to 80, so by default Autoconfigurator stores the results in the directory
.../results/gpt3/5b_80gb/
. The base configuration is stored in that directory with the name
base_cfg_<model_size>.yaml
.
When you run the training HP search pipeline, it stores results in three
subdirectories in the logs
directory.
candidate_configs
contains all of the YAML files with all of the configurations generated by the HP search.training_logs
contains all of the logs of training each of the individual configurations AutoConfigurator generated. Iflimit_search_runs
was set to 30, then there are 30 subdirectories with the logs for each of the 30 models.final_results
contains:A log file that lists the
output_top_n
fastest configurations, sorted from fastest to slowest, created after all of the training runs have completed and the final run has analyzed the throughput of each configuration.A CSV file that contains all of the results from every configuration that AutoConfigurator ran for a given model size, sorted from fastest to slowest. This file contains information such as the samples per second achieved by each configuration, the time per global step, the TFLOPS per GPU achieved, and so on.
A YAML file that corresponds to the configuration with the lowest training time. This is the recommended model for training.
For the inference HP search, Autoconfigurator stores the results in the
directory specified in the results_dir
parameter of the YAML configuration
file. The results are stored in that directory at the relative pathname inference/final_summary/final_output.csv
. This csv file contains the
results of every model that was run by the AutoConfigurator HP search.
The result of the Training HP Search varies when it is run with different numbers of nodes. This is mainly caused by the new distributed optimizer, which provides higher memory savings when using more nodes (i.e. a higher data parallel value).
Logging Runs with Weights and Biases
You can use Weights and Biases (W&B) to log all of the training search
runs by changing the values of the wandb
parameters in the
conf/config.yaml
file.
Set
enable
toTrue
.Set
api_key_file
to point the pathname of the file that contains the W&B API key. The API key must be in the first line of the file.Set
project
to specify the name of the W&B project where the metrics are to be stored. You need not provide the name of each run; Autoconfigurator generates the names automatically, using the model name, model size, and hyperparameters specified for the run.
wandb: # Weights and Biases (W&B) logging.
enable: True
api_key_file: null
project: nemo-megatron-autoconfig