Hyperparameter Usage and Tuning#

This section discusses recommended practices for choosing and tuning hyperparameters for BioNeMo models.

General Information#

Configuration files and command-line arguments can be used to define hyperparameters. Sets of configuration parameters are based on YAML files and constructed using Hydra. Refer to the Command Line Configuration section for more information.

Hyperparameter Tuning Tips#

There is a lot of information on how to tune hyperparameters (refer to comprehensive guide). Here we provide a few tips that are specific to BioNeMo-based models.

  1. Start small initially: dataset size, model parameters, number of epochs, and number of GPUs to tune hyperparameters.

  2. Scale up experiment size gradually with best performing hyperparameters, for example, model size increases from 10M, 100M, 1B, 5B, 15B.

  3. Use Weights & Biases to track experiment results. Group experiments by project per set of hyperparameters, and use meaningful names for experiments which include the parameters that have been varied. Stop experiments early if they are performing poorly.

General Training Tips#

Gradient Norm#

  • To mitigate spikes in gradient norm, reduce the gradient clipping value

  • Lower the learning rate, although this can also reduce performance and make training slow

  • Increase the number of warmup steps

  • Replace layer normalization with configuration model.normalization=normformer

  • Increase global batch size (for example, using gradient accumulation)

  • Skip updates with large norm (for example, top 0.005% batches, leads to smoother loss)

  • For debugging: try trainer.precision=32 to differentiate problems in numerical calculation vs. data batch

Data Cleaning#

  • Ensure data has been deduplicated to reduce memorization

  • Filter out irrelevant / bad quality data (for example invalid SMILES strings)

Model Architecture#

  • Pre-norm gives better performance but is less stable than post-ln. Normformer will be the most stable. Configure with model.transformer_block_type with options [‘pre_ln’, ‘post_ln’, ‘normformer’].

  • Model activaction of SwiGLU provides better performance at ~2% slowdown in training speed. Configure with model.activation=swiglu.

  • Remove dropouts, configure with model.hidden_dropout=0.0, model.attention_dropout=0.0

  • Remove bias term from linear layers (increase stability and speed, almost no performance cost). Configure with model.bias=false.

Optimization#

  • Use 1-2k warmup steps with configuration model.optim.sched.warmup_ratio=0.01 and trainer.max_steps=100000, or alternatively define warmup steps directly with configuration model.optim.sched.warmup_steps=1000.

  • Batch size ramp-up can be done via consecutive training with increased batch size, where previous model is used to initialize the weights of the next model. This can be done with configuration restore_from_path=<PATH TO .nemo FILE>.