Example Scripts for Pretraining and Fine-tuning

User Guide (Latest Version)

These scripts run a recommended config for GPT3, LLAMA2, NeMo Pretraining, and Fine-tuning for various model sizes on A100, H100. For example, for GPT3 pretrain the following folders provide sample scripts.

  • A100 : Scripts to run GPT pretraining on NVIDIA A100, in bf16 data type

  • H100 : Scripts to run GPT pretraining for NVIDIA H100, in fp8 data type

Setup

  1. To run these scripts, you must have access to the NeMo Framework Container.. - Please sign in at NGC (user = ea-bignlp/ga-participants) to access the catalog.

  2. Update the following bash variables in the example run scripts:

    • NEMO_MEGATRON_LAUNCHER_DIR : the directory of where this repository is located

    • DATA_DIR : the directory of the dataset used for pretraining, by default this is NEMO_MEGATRON_LAUNCHER_DIR/data

  3. Enter your cluster environment settings at config.yaml

    For bcm type clusters update the job name, partition, and account at bcm.yaml

  4. For testing performance with synthetic data on an interactive node, you need to add the following options to your bash script:

    Copy
    Copied!
                

    cluster_type=interactive \ ++training.cluster_type=BCP \ training.model.data.data_impl="mock" \ training.model.data.data_prefix=[]

For further details see General Configuration

Collect Results

For performance, the “step_time_per_sec” variable on the console out provides a quick way to read performance of a workload.

For more details and graphics, one can use tensorboard or Weights and Biases. In order to use that, please use results stored at NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name> with the following structure:

  • NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/<experiment_name>.yaml : The config of the pretrained model

  • NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/<jobname>_<experiment_name>.sh : The autogenerated .sh file that was run

  • NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/results/ : Directory contained per rank logs, and tensorboard data.

For further details see Interpreting the Results

Previous Running NeMo Curator on Kubernetes
Next Changelog
© | | | | | | |. Last updated on Jun 19, 2024.