Training with Predefined Configurations

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide (Latest) Training with Predefined Configurations

User Guide (Latest Version)

NVIDIA provides 3 configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/neva directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and requirements.

Language Model	Vision Encoder	Multimodal Connector Type	Tensor Model Parallel Size	Pipeline Model Parallel Size	Batch size per GPU	Accumulated Global Batch Size	Precision	AMP Level	Total Training Samples Seen
LLaMA-2-7B-Chat (frozen)	CLIP-L-336px (frozen)	MLP Layers (trainable)	4	1	32	256	BF16	O2	550K
LLaMA-2-13B-Chat (frozen)	CLIP-L-336px (frozen)	MLP Layers (trainable)	8	1	32	256	BF16	O2	550K
LLaMA-2-70B-Chat (frozen)	CLIP-L-336px (frozen)	MLP Layers (trainable)	8	1	8	256	BF16	O2	550K

To enable the training stage using a NeVA model, follow these configuration steps:

Navigate to the defaults section in conf/config.yaml. Update the training field to reference the desired ViT configuration file. For instance, if you wish to utilize the LLaMA-2-7B-Chat (i.e., llama2_7b_chat) configuration, modify the training field to neva/llama2_7b_chat.
Copy

Copied!
```
            
            defaults:
  - _self_
  - cluster: bcm
  - data_preparation: null
  - training: neva/llama2_7b_chat
  ...
        
```

Within the stages field of conf/config.yaml, ensure the training stage is listed.

Copy
Copied!

            
            stages:
  - training
  ...

Execute launcher pipeline: python3 main.py

Remarks:

Prior to initiating your training, ensure you’ve readied all necessary datasets and checkpoints.
Before starting the training, set the correct path for the dataset and checkpoints in neva/llama2_{model_size}_chat.yaml.
If you are training using the Vicuna v1.5 language model checkpoints, you can utilize the same model size configuration as in Llama2 Chat, since they are structurally identical. For instance, when using the Vicuna v1.5 7B model, you can simply opt for the llama2_7b_chat configuration. You only need to set the following: training.model.mm_cfg.llm.model_type=v1
For sequence packing, refer to the documentation at NeVA Sequence Packing.

When employing pipeline parallelism, the vision encoder (if loaded from huggingface) will duplicate on GPUs where pipeline parallelism rank equals 0.

Copy
Copied!

            
            Loading ViT from HF
DP0
   PP rank 0
      TP rank 0 (if HF, ViT)
      TP rank 1 (if HF, ViT)
   PP rank 1
      TP rank 0
      TP rank 1
DP1
   PP rank 0
      TP rank 0 (if HF, ViT)
      TP rank 1 (if HF, ViT)
   PP rank 1
      TP rank 0
      TP rank 1

Loading ViT from .nemo
DP0
   PP rank 0
   TP rank 0 (if NeMo, ViT TP rank 0)
      TP rank 1 (if NeMo, ViT TP rank 1)
   PP rank 1
   TP rank 0
      TP rank 1
DP1
   PP rank 0
   TP rank 0 (if NeMo, ViT TP rank 0)
      TP rank 1 (if NeMo, ViT TP rank 1)
   PP rank 1
   TP rank 0
      TP rank 1

Recommended FP8 recipe:

Copy
Copied!

            
            training.model.fp8=True \
training.model.fp8_e4m3=False \
training.model.fp8_hybrid=True \
training.model.fp8_margin=0 \
training.model.fp8_interval=1 \
training.model.fp8_amax_history_len=1024 \
training.model.fp8_amax_compute_algo=max

Previous Data Preparation

Next Fine-tuning