Hyperparameter Usage and Tuning
Hyperparameter Usage and Tuning#
This section discusses recommended practices for choosing and tuning hyperparameters for BioNeMo models.
Configuration files and command-line arguments can be used to define hyperparameters. Sets of configuration parameters are based on YAML files and constructed using Hydra. Refer to the Command Line Configuration section for more information.
Hyperparameter Tuning Tips#
There is a lot of information on how to tune hyperparameters (refer to comprehensive guide). Here we provide a few tips that are specific to BioNeMo-based models.
Start small initially: dataset size, model parameters, number of epochs, and number of GPUs to tune hyperparameters.
Scale up experiment size gradually with best performing hyperparameters, for example, model size increases from 10M, 100M, 1B, 5B, 15B.
Use Weights & Biases to track experiment results. Group experiments by project per set of hyperparameters, and use meaningful names for experiments which include the parameters that have been varied. Stop experiments early if they are performing poorly.
Recommended Hyperparameter Search Method#
Below, hyperparameters and recommendations for their adjustment are provided. The proposed values are generally applicable to large language models built upon NeMo Megatron, such as BioNeMo models.
trainer.precision=bf16if available, otherwise use
trainer.precision=32if training is unstable with bf16 or 16-bit.
Recommended alternative values: 0.5, 0.1
Reduce value if training is unstable.
Optimizer and Weight Decay#
Recommended alternative values:
Increase weight decay value to mitigate over-fitting and stabilize training. Values that become too large may degrade performance.
Recommended alternative values: 2e-4, 5e-5
Instability in training or validation loss may indicate that the learning rate is too high. Slow convergence and poor performance of converged model may indicate that LR is too low.
model.micro_batch_size=N(per GPU batch size)
Recommended value: use
Nresulting in 85-90% GPU memory utilization
model.global_batch_size=nullto compute global batch size at run-time.
Further increase the effective global batch size by using gradient accumulation (for example,
For large models (that is > 1B parameters) use model tensor parallelism
For larger models (that is > 5B parameters) add also model pipeline parallelism
The various parallelism options are independent and can be combined as needed.
Increase value to mitigate over-fitting. Values too large may degrade performance.
General Training Tips#
To mitigate spikes in gradient norm, reduce the gradient clipping value
Lower the learning rate, although this can also reduce performance and make training slow
Increase the number of warmup steps
Replace layer normalization with configuration
Increase global batch size (for example, using gradient accumulation)
Skip updates with large norm (for example, top 0.005% batches, leads to smoother loss)
For debugging: try
trainer.precision=32to differentiate problems in numerical calculation vs. data batch
Ensure data has been deduplicated to reduce memorization
Filter out irrelevant / bad quality data (for example invalid SMILES strings)
Pre-norm gives better performance but is less stable than post-ln. Normformer will be the most stable. Configure with
model.transformer_block_typewith options [‘pre_ln’, ‘post_ln’, ‘normformer’].
Model activaction of SwiGLU provides better performance at ~2% slowdown in training speed. Configure with
Remove dropouts, configure with
Remove bias term from linear layers (increase stability and speed, almost no performance cost). Configure with
Use 1-2k warmup steps with configuration
trainer.max_steps=100000, or alternatively define warmup steps directly with configuration
Batch size ramp-up can be done via consecutive training with increased batch size, where previous model is used to initialize the weights of the next model. This can be done with configuration
restore_from_path=<PATH TO .nemo FILE>.