Playbooks

Overview

The NeMo Framework playbooks demonstrate how to use the NeMo Framework training container to fine-tune Large Language Models (LLMs) with different data sets. The information includes how to:

  • Set up your infrastructure to use the playbooks with DGX Cloud and Kubernetes.

  • Use Llama 2, Mixtral-8x7B, and Mistral-7B LLMs to pre-process, train, validate, test, and run fine-tuning scripts.

  • Apply supervised fine-tuning (SFT) and parametric-efficient fine-tuning (PEFT) techniques to the databricks-dolly-15k and PubMedQA datasets.

  • Set up and launch foundation model pre-training in your infrastructure.

Infrastructure Setup

  • The Run NeMo Framework on DGX Cloud playbook focuses on preparing a dataset and pre-training a foundational model with NeMo Framework on DGX Cloud. The playbook covers essential aspects of DGX Cloud, such as uploading containers, creating workspaces, mounting workspaces, launching jobs, and pre-training a model.

  • The Run NeMo Framework on Kubernetes playbook demonstrates deploying and managing NeMo using Kubernetes. The playbook covers cluster setup, NeMo Framework installation, data preparation, and model training.

Model Alignment

  • The NeMo Framework SFT with Llama 2 playbook shows how to fine-tune Llama 2 models of various sizes using SFT against the databricks-dolly-15k dataset. It demonstrates data preprocessing, training, validation, testing, and running the fine-tuning scripts included in NeMo Framework. It also shows how to run inference against the fine-tuned model.

  • The NeMo Framework SFT with Mistral-7B playbook shows how to fine-tune the Mistral-7B model using SFT against the databricks-dolly-15k dataset. It demonstrates data preprocessing, training, validation, testing, and running the fine-tuning scripts included in NeMo Framework.

  • The NeMo Framework SFT playbook shows how to fine-tune Mixtral 8x7B and Nemotron 340B using SFT against the databricks-dolly-15k dataset. It demonstrates data preprocessing, training, validation, testing, and running the fine-tuning scripts included in NeMo Framework. It also shows how to run inference against the fine-tuned model.

  • The NeMo Framework PEFT with Mistral-7B playbook shows how to fine-tune the Mistral-7B model using PEFT against the PubMedQA dataset. It demonstrates data preprocessing, training, validation, testing, and running the fine-tuning scripts included in NeMo Framework. It also shows how to run inference against the fine-tuned model.

  • The NeMo Framework PEFT playbook shows how to fine-tune Mixtral 8x7B and Llama 2 models of various sizes using PEFT against the PubMedQA dataset. It demonstrates data preprocessing, training, validation, testing, and running the fine-tuning scripts included in NeMo Framework.

Pre-training

  • The NeMo Framework Foundation Model Pre-training playbook focuses on successfully launching a foundation model pre-training job on your infrastructure and getting the necessary training artifacts as the output of the successful runs. It demonstrates how to execute the workflow of pre-training foundation models using NeMo Framework and the Pile dataset, as well as producing checkpoints, logs, and event files.

  • The NeMo Framework AutoConfigurator playbook demonstrates how to use NeMo Framework AutoConfigurator to determine the optimal model size for a given compute and training budget. Then, it shows how to produce optimal foundation model pre-training and inference configurations to achieve the highest throughput runs. It specifically focuses on automating the configuration process for NeMo, such as autoconfiguration, parameter tuning, and optimization to streamline setup.

  • The NeMo Framework Single Node Pre-training playbook shows how to pre-train a simple GPT-style model using consumer hardware.

Data Curation

  • The Distributed Data Classification notebook showcases how to use NeMo Curator with two distinct classifiers: one for evaluating data quality and another for identifying data domains. The integration of these classifiers streamlines the annotation process, thereby enhancing the combination of diverse datasets essential for the training of foundational models.

  • The PEFT Curation tutorial demonstrates how to use the NeMo Curator Python API to curate a dataset for PEFT. Specifically, it uses the Enron dataset, which contains emails along with classification labels. Each email entry includes a subject, body, and category (class label). Throughout the tutorial, different filtering and processing operations are demonstrated, which can be applied to each record.

  • The Single Node notebook provides a typical data curation pipeline using NeMo Curator, with the Thai Wikipedia dataset as an example. It includes demonstrations of how to download Wikipedia data using NeMo Curator, perform language separation using FastText, apply GPU-based exact deduplication and fuzzy deduplication, and utilize CPU-based heuristic filtering.

  • The Tinystories tutorial shows how to use the NeMo Curator Python API to curate the TinyStories dataset. TinyStories is a dataset of short stories generated by GPT-3.5 and GPT-4, featuring words that are understood by 3 to 4-year olds. The small size of this dataset makes it ideal for creating and validating data curation pipelines on a local machine.

Playbooks