NVIDIA provides 3 configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is
equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/neva
directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By
customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and
requirements.
Language Model |
Vision Encoder |
Multimodal Connector Type |
Tensor Model Parallel Size |
Pipeline Model Parallel Size |
Batch size per GPU |
Accumulated Global Batch Size |
Precision |
AMP Level |
Total Training Samples Seen |
---|---|---|---|---|---|---|---|---|---|
LLaMA-2-7B-Chat (frozen) | CLIP-L-336px (frozen) | MLP Layers (trainable) | 4 | 1 | 32 | 256 | BF16 | O2 | 550K |
LLaMA-2-13B-Chat (frozen) | CLIP-L-336px (frozen) | MLP Layers (trainable) | 8 | 1 | 32 | 256 | BF16 | O2 | 550K |
LLaMA-2-70B-Chat (frozen) | CLIP-L-336px (frozen) | MLP Layers (trainable) | 8 | 1 | 8 | 256 | BF16 | O2 | 550K |
To enable the training stage using a NeVA model, follow these configuration steps:
Navigate to the
defaults
section inconf/config.yaml
. Update thetraining
field to reference the desired ViT configuration file. For instance, if you wish to utilize theLLaMA-2-7B-Chat
(i.e.,llama2_7b_chat
) configuration, modify thetraining
field toneva/llama2_7b_chat
.defaults: - _self_ - cluster: bcm - data_preparation: null - training: neva/llama2_7b_chat ...
Within the
stages
field ofconf/config.yaml
, ensure the training stage is listed.stages: - training ...
Execute launcher pipeline:
python3 main.py
Remarks:
Prior to initiating your training, ensure you’ve readied all necessary datasets and checkpoints.
Before starting the training, set the correct path for the dataset and checkpoints in
neva/llama2_{model_size}_chat.yaml
.If you are training using the Vicuna v1.5 language model checkpoints, you can utilize the same model size configuration as in Llama2 Chat, since they are structurally identical. For instance, when using the Vicuna v1.5 7B model, you can simply opt for the
llama2_7b_chat
configuration. You only need to set the following:training.model.mm_cfg.llm.model_type=v1
For sequence packing, refer to the documentation at NeVA Sequence Packing.
When employing pipeline parallelism, the vision encoder (if loaded from huggingface) will duplicate on GPUs where pipeline parallelism rank equals 0.
Loading ViT from HF DP0 PP rank 0 TP rank 0 (if HF, ViT) TP rank 1 (if HF, ViT) PP rank 1 TP rank 0 TP rank 1 DP1 PP rank 0 TP rank 0 (if HF, ViT) TP rank 1 (if HF, ViT) PP rank 1 TP rank 0 TP rank 1 Loading ViT from .nemo DP0 PP rank 0 TP rank 0 (if NeMo, ViT TP rank 0) TP rank 1 (if NeMo, ViT TP rank 1) PP rank 1 TP rank 0 TP rank 1 DP1 PP rank 0 TP rank 0 (if NeMo, ViT TP rank 0) TP rank 1 (if NeMo, ViT TP rank 1) PP rank 1 TP rank 0 TP rank 1
Recommended FP8 recipe:
training.model.fp8=True \ training.model.fp8_e4m3=False \ training.model.fp8_hybrid=True \ training.model.fp8_margin=0 \ training.model.fp8_interval=1 \ training.model.fp8_amax_history_len=1024 \ training.model.fp8_amax_compute_algo=max