Training with Predefined Configurations

User Guide (Latest Version)

Currently, NVIDIA offers one configuration for a RETRO model size of 300M. This configuration includes carefully selected hyperparameters, which you can use as a guideline for any custom model configurations.

The training configurations are defined by yaml files in the directory conf/training/retro. You can choose a configuration by setting the value of the training variable in conf/config.yaml to the corresponding training yaml file. For example, to train the 300m parameters RETRO model, set the training variable to retro/300m, which points the to the conf/training/retro/300m.yaml file.

The 300M model uses the bf16 data type. It consists of 12 transformer layers for the decoder and 2 transformer layers for the encoder. The model has a hidden size of 768, a feedforward network size of 3072, and 12 attention heads with a GeLU activation function. The sample sequence length is 2048, the chunk length is 64, and the optimizer used is Distributed Adam. This model does not employ any model parallelism. For more parameter details, refer to the configuration file retro/300m.yaml.

The following section provides instructions for running training on a Slurm cluster.

Slurm

To train a 300m model on a Slurm cluster, modify the conf/config.yaml file to set:

Copy
Copied!
            

training: retro/300m stages: - training launcher_scripts_path:/path/to/launcher_scripts data_dir:/path/to/retro_workdir base_results_dir:/path/to/results

With /path/to/retro_workdir points to directory containing the preprocessed pretraining data, and /path/to/results points to directory to save the checkpoint.

Then, run the following command. All the configurations will be read from conf/config.yaml and conf/training/retro/300m.yaml:

Copy
Copied!
            

python3 main.py

Previous Training
Next Model Inferencing
© | | | | | | |. Last updated on Jun 19, 2024.