Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Training with Predefined Configurations

Currently, NVIDIA offers one configuration for a RETRO model size of 300M. This configuration includes carefully selected hyperparameters, which you can use as a guideline for any custom model configurations.

The training configurations are defined by yaml files in the directory conf/training/retro. You can choose a configuration by setting the value of the training variable in conf/config.yaml to the corresponding training yaml file. For example, to train the 300m parameters RETRO model, set the training variable to retro/300m, which points the to the conf/training/retro/300m.yaml file.

300M Model Configuration

The 300M model uses the bf16 data type. It consists of 12 transformer layers for the decoder and 2 transformer layers for the encoder. The model has a hidden size of 768, a feedforward network size of 3072, and 12 attention heads with a GeLU activation function. The sample sequence length is 2048, the chunk length is 64, and the optimizer used is Distributed Adam. This model does not employ any model parallelism. For more parameter details, refer to the configuration file retro/300m.yaml.

The following section provides instructions for running training on a Slurm cluster.

Slurm

To train a 300m model on a Slurm cluster, modify the conf/config.yaml file to set:

training: retro/300m
stages:
  - training
launcher_scripts_path:/path/to/launcher_scripts
data_dir:/path/to/retro_workdir
base_results_dir:/path/to/results

With /path/to/retro_workdir points to directory containing the preprocessed pretraining data, and /path/to/results points to directory to save the checkpoint.

Then, run the following command. All the configurations will be read from conf/config.yaml and conf/training/retro/300m.yaml:

python3 main.py