Training with Predefined Configurations

User Guide (Latest Version)

NVIDIA provides configurations for Nemotron 8B/15B/22B/340B model. The configuration includes carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.

To run Nemotron training update conf/config.yaml:

Copy
Copied!
            

defaults: - training: nemotron/nemotron_340b stages: - training

Execute launcher pipeline: python3 main.py

Configuration

Default configurations for model size specific training can be found in the folder conf/training/nemotron. The configuration is divided into four sections run, trainer, exp_manager, and model.

Copy
Copied!
            

run: name: nemotron_340b results_dir: ${base_results_dir}/${.name} time_limit: "0-04:00:00" dependency: "singleton"

Set the number of nodes and devices for training:

Copy
Copied!
            

trainer: num_nodes: 768 devices: 8 max_steps: 300000 # consumed_samples = global_step * global_batch_size max_time: "05:23:30:00" # days:hours:minutes:seconds

Set configurations for creating a checkpoint:

Copy
Copied!
            

exp_manger: create_checkpoint_callback: True checkpoint_callback_params: monitor: val_loss save_top_k: 10 mode: min always_save_nemo: False # saves nemo file during validation, not implemented for model parallel save_nemo_train_end: False # not recommended when training large models on clusters with short time limits filename: 'megatron_nemotron--{val_loss:.2f}-{step}-{consumed_amples}' model_parallel_size: ${multiply:${training.model.tensor_model_parallel_size}, ${training.model.pipeline_model_parallel_size}}

Set wandb configurations:

Copy
Copied!
            

exp_manager: create_wandb_logger: True wandb_logger_kwargs: project: nemo_nemotron name: ${training.run.name}

Set tensor parallel and pipeline parallel size:

Copy
Copied!
            

model: tensor_model_parallel_size: 8 pipeline_model_parallel_size: 12

Set data distribution configuration:

Copy
Copied!
            

model: data: data_prefix: - .0333 - ${data_dir}/my-nemotron_00_text_document - .0333 - ${data_dir}/my-nemotron_00_text_document

Previous Data Preparation
Next Model Evaluation
© | | | | | | |. Last updated on Jun 19, 2024.