4. NeMo End-to-End Workflow Example#

Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo-Run. For legacy documentation on NeMo 1.0 please refer to the (Deprecated) NeMo 1.0 End to End Workflow Example.

This workflow provides a full end-to-end example of preparing a dataset, training a foundation model based on Llama-3.1-8B, and deploying the model for non-production inference using the re-designed NeMo 2.0 for NeMo Framework. This guide will be split into sub-sections to describe each part in detail.

NeMo 2.0 now uses a Pythonic API which allows it to be integrated with IDEs such as Visual Studio Code (VS Code) and supports type checking.

While this guide demonstrates pre-training a Llama-3.1-8B model from scratch, it can be modified to train any supported model with NeMo 2.0. For more information about NeMo 2.0, including the latest list of supported models, see the NVIDIA NeMo Framework User Guide.

4.1. Requirements#

The following is a list of requirements to follow this complete workflow:

  • A NVIDIA Run:ai cluster with at least 4x A100 or newer GPU nodes.

  • A NVIDIA Run:ai project identified for use along with the corresponding Kubernetes namespace for that project (by default, the project name with a runai- prefix).

  • A NVIDIA Run:ai user account with at least L1 Researcher and Application Administrator privileges.

  • VS Code installed on a local machine on an allowlisted IP address which can connect to the DGX Cloud cluster. Download instructions here.

  • A GitHub account (Personal or Enterprise).

  • A NVIDIA Run:ai Application for REST API access (setup steps in the following section).

  • A PVC with large storage capacity for fast checkpoint writes (setup steps in the following section).

  • A Hugging Face account with an API token (setup steps in the following section).

  • A Weights & Biases account with an API token (setup steps in the following section).

4.2. Initial Setup#

This guide uses two external services to simplify the LLM development process: Hugging Face and Weights & Biases.

Hugging Face contains resources for many of the most popular language models and datasets in the community. We can leverage these resources while training the model to minimize deployment steps and be consistent with community model assumptions.

This workflow walks through training a Llama-3.1-8B model from scratch. The dataset we use needs to be tokenized using a custom tokenizer. Luckily, Meta, the company that produced the Llama models, published their tokenizer for the Llama models on Hugging Face. In order to use the tokenizer, we need to create a Hugging Face account and accept the Llama-3.1-8B license on their model repository page. The following steps walk you through that process.

4.2.1. Create an Application#

DGX Cloud supports “Applications” which allow developers to send REST API requests to the cluster for viewing job states, checking cluster resources, and launching and interacting with jobs. NeMo 2.0 leverages Applications for launching distributed training jobs directly on the cluster.

To allow NeMo to integrate with the DGX Cloud cluster, we need to create an Application. To begin, open the Applications page in the NVIDIA Run:ai web UI under the “Access” menu.

  1. Click the + New Application button in the Applications page.

  2. Enter a name for the Application, such as nemo-run and click Create.

  3. The form will refresh and show the Application details including the name, client ID, and the client secret. To view the client secret, click the Show secret button.

    Note

    Both the Client ID and the Client Secret fields will be required for authenticating with the cluster when launching jobs. Make a copy of these values and keep them in a safe location. The secret can be regenerated if lost by clicking the Regenerate Client Secret button on the Applications page.

  4. Close out of the form to return to the “Applications” page in the web UI. Select the newly created Application and click the Access Rules button to add additional permissions to the Application.

  5. Click + Access Rule and select the L1 Researcher role in the drop-down. Set the scope for the desired project or department. To restrict the Application to be used by a specific team/user, set the scope to a single project. For broader, shared usage, set the scope to a department or cluster level.

  6. Repeat the previous step by adding another Access Rule, this time selecting Application administrator. Use the same scope as before.

The Application is now ready.

4.2.2. Create a Persistent Volume Claim (PVC)#

Training an LLM requires a lot of data, including pre-training datasets, multiple checkpoints, long log files, configs, and scripts. These files typically need to be read from all nodes, so we need shared storage that all pods can access concurrently. For this, we can use a PVC that will store all of our training resources.

Note

It is strongly recommended to allocate as much storage as practical for a PVC. With DGX Cloud clusters, the larger the PVC size, the faster the read and writes will be, enabling large checkpoints to be saved more quickly during training, reducing the overall training time. If possible, allocate the entire cluster storage capacity in a single PVC and share that PVC amongst all workloads with unique subdirectories for each workload to take advantage of the highest possible storage performance.

To create a PVC, go to the Data Sources tab in the NVIDIA Run:ai web UI and follow these steps:

  1. Click the New Data Source button at the top of the page followed by PVC.

  2. Select the scope that corresponds with the project you will be training the model in. For example, if your project is named default, then select the default project scope.

  3. Enter a memorable name for the PVC, such as nemo-workspace, and optionally, give it a description.

  4. For the data mount, select New PVC.

  5. Select the appropriate storage class type:

    • Storage Type: dgxc-enterprise-file

    • Access mode: Read-write by many nodes

  6. For the claim size, enter at least 10 TB. If training a larger model and using a larger dataset, it might be necessary to request more storage capacity.

  7. Enter /nemo-workspace for the container path. This will mount the PVC at /nemo-workspace inside all pods that attach this PVC.

  8. Click Create Data Source once you have finished filling out the form to create the PVC.

Once the PVC has been created, get the Kubernetes name for the PVC in the UI. To do so, open the Data Sources page again and click the Columns button in the top right of the page. Select Kubernetes name in the list. This will add a new column to the table which includes the Kuberentes name which will look similar to nemo-workspace-<project-name>-xxxxx depending on the name you created.

Make a note of the Kubernetes name for the PVC as this will be required in a later step.

4.2.3. Create a Hugging Face Account#

If you don’t have a Hugging Face account already, create one by going to https://huggingface.co/join and signing up with your corporate email account.

Once your account is set up, go to https://huggingface.co/settings/tokens while logged in to create a personal access token. Create a new token with Read access and give it a memorable name. Save the generated token in a safe place, as it won’t be viewable again for security reasons.

4.2.4. Accept the Llama-3.1-8B License#

As mentioned earlier, this example uses the official Llama-3.1-8B tokenizer available on Hugging Face, which requires agreeing to their license on their model page. To do so, navigate to https://huggingface.co/meta-llama/Llama-3.1-8B while logged in. Read the privacy policy at the top of the model card page, then click the Agree and access repository button towards the top of the page to accept the license. Now, you can download resources from this repository using your personal access token.

4.2.5. Create a Weights & Biases Account#

Weights & Biases is a tool that allows developers to easily track experiments for AI applications. NeMo Framework natively supports logging many values such as training loss, learning rate, and gradient norm as well as resource utilization with Weights & Biases. Weights & Biases is highly recommended for tracking NeMo Framework jobs.

To get started with Weights & Biases, navigate to https://wandb.ai in a web browser and click the Sign Up button in the top right to create a free account. Once logged in, go to https://wandb.ai/settings and go to the bottom to create a new API key. This API key will be used while launching workflows to automatically log to Weights & Biases.

4.3. Create an Interactive Visual Studio Code Job#

We will be creating a workflow that sets up VS Code to save and modify helper scripts, launch training jobs, and monitor the data prep and training processes.

  1. Go to the Workloads overview page and click the + New Workload button in the top left. A drop-down menu will appear. From the drop-down menu, select Workspace. You will be taken to the New workspace creation page.

  2. Select the desired project to run your job in.

  3. Leave the Template pane of the form set to Start from scratch, or select an existing template if created.

  4. Enter a descriptive name for your Workspace, such as nemo-vs-code. Click Continue. After a few seconds, the Environment pane of the creation process will appear.

  5. Click the + New Environment button in the top right of the Environment pane. The Environment creation form will open.

  6. In the Environment creation form, enter a name for the environment, such as “nemo-2502-vscode”, and optionally add a description.

  7. Under Image URL, put nvcr.io/nvidia/nemo:25.02. This pulls the latest NeMo container from NGC as of the time of writing.

  8. Under the Workload architecture & type pane, select Standard and Workspace if they are not selected already.

  9. Click the Runtime settings pane, then click to expand the commands and arguments pane. Enter bash -c as the command, and 'curl -Lk "https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-x64" --output vscode_cli.tar.gz && tar -xf vscode_cli.tar.gz && ./code tunnel --random-name' for the arguments (note the single-quotes at the beginning and end of the line). This will install the VS Code CLI in the NeMo Framework container and spawn a Code Tunnel for remote development when the container runs.

  10. Click Create Environment in the bottom right of the page. The environment you just created should now be selected.

  11. Go to the Compute resource pane and select a CPU-only compute resource for your environment.

  12. Select the nemo-workspace PVC created in the previous section under the Data sources form.

  13. Go to the bottom of the page and click Create Workspace.

  14. After creating the workspace, you will be taken to the workload overview page, where you can view the status of your workload. Your interactive workload is ready once the status reads “Running”.

  15. Once the status is “Running”, select the checkbox next to your workload and open the Logs tab in the details page to view the pod logs. At the bottom, you should see a line similar to the following:

    [2025-02-04 17:02:45] info Using GitHub for authentication, run `code tunnel user login --provider <provider>` option to change this.
    To grant access to the server, please log into https://github.com/login/device and use code XXXX-XXXX
    
  16. Navigate to login/device and enter the 8-digit code seen in the logs in the previous step. This creates the connection for the remote development tunnel inside of the pod.

Once authenticated, there are two options for connecting to the VS Code session—via a local VS Code installation or in a browser. Either path can be used for the remainder of this tutorial.

4.3.1. (Option 1) Connect to the VS Code Tunnel Locally#

After creating the VS Code tunnel on DGX Cloud, you can connect to the remote session via a local installation of VS Code.

  1. From your local machine, open VS Code and install the official “Remote - Tunnels” extension published by Microsoft.

  2. Once installed, open the Command Palette in VS Code, or click the blue “Open a remote window” button in the bottom left corner of the window. Then, select “Connect to Tunnel…” in the Command Palette at the top of the window.

  3. Select “GitHub” for the account to use.

  4. Log in with your GitHub account if prompted.

  5. Select the Tunnel session that was created in the previous section. The name should be similar to the workspace name.

Once you select the session, the window will refresh and you will be connected to the remote pod. This gives you local development access on the remote system resources and allows you to connect to your PVC to directly write data to the shared storage.

This VS Code session will be referenced throughout the guide and should be kept open. This will be referred to as the “local VS Code session”.

4.3.2. (Option 2) Connect to the VS Code Tunnel Through a browser#

The VS Code tunnel can also be accessed via a web browser. To connect to VS Code via the browser, view the logs for the VS Code job in the Logs tab as in the previous section after authenticating with GitHub. Once authenticated, there will be a line in the log output that is similar to the following: “Open this link in your browser https://vscode.dev/tunnel/<pod-name>/workspace”. Copy the link and open it in a browser tab. This will open a connection to VS Code in the browser and prompt you to login with GitHub to connect to the remote session.

4.4. Prepare the Data#

NeMo Framework supports processing custom text-based datasets for pre-training new models. The data preprocessor requires datasets to be cleansed, excluding any sensitive or improperly formatted data that is unsuitable for use during pre-training. Each file in the dataset must be in .json or, ideally, .jsonl format. Datasets can be downloaded from external sources or uploaded directly to the PVC.

The following example walks through downloading, extracting, concatenating, and preprocessing the SlimPajama dataset which includes a large corpus of text from several domains and has been deduplicated and cleaned to make it a great candidate for pre-training LLMs. While the remainder of the document will be based on the SlimPajama dataset, this general process can be followed for most custom datasets and will provide guidance on how to adapt as needed.

4.4.1. Set Up the Script#

First, pull NeMo-Run, which is used to launch jobs on DGX Cloud remotely. To do so, open a terminal in your local VS Code session and run the following. This will clone the NeMo-Run repository to your PVC where it can be used for all jobs:

cd /nemo-workspace
git clone https://github.com/nvidia/nemo-run
cd nemo-run

We will leverage four different scripts to prepare the SlimPajama dataset for pre-training a Llama-3.1-8B-based LLM. These scripts will be saved in the PVC that was created during the initial setup step. The scripts are as follows:

Download

The first script downloads the entire SlimPajama-627B training dataset from Hugging Face to the mounted PVC. The dataset is spread across nearly 60,000 individual shards, all needing to be downloaded independently. To make the process faster, the job leverages PyTorch distributed communication to spread the download equally amongst all workers in the cluster. Using the local VS Code session created previously, save the following file in the PVC at /nemo-workspace/nemo-run/download.py.

Note

The dataset is evenly divided amongst ten chunks on Hugging Face, each being its own subdirectory of files. The download.py script below has a CHUNKS = 10 variable at the top of the file to download all ten chunks. If desired, this value can be reduced to only download the first N chunks of the dataset. This is useful for quick workload validations that don’t rely on a complete dataset. The remainder of this document will assume all ten chunks are pulled from the dataset, but the steps will still work if using fewer chunks.

import os
import requests
import time

CHUNKS = 10
SHARDS = 6000

def download_shard(url, filename, retry=False):
    if os.path.exists(filename):
        return

    response = requests.get(url)

    # In case of getting rate-limited, wait 3 seconds and retry the
    # download once.
    if response.status_code == 429 and not retry:
        time.sleep(3)
        download_shard(url, filename, retry=True)

    if response.status_code != 200:
        return

    with open(filename, 'wb') as fn:
        fn.write(response.content)

def split_shards(wsize):
    shards = []
    shards_to_download = list(range(SHARDS))

    for shard in range(wsize):
        idx_start = (shard * SHARDS) // wsize
        idx_end = ((shard + 1) * SHARDS) // wsize
        shards.append(shards_to_download[idx_start:idx_end])
    return shards

def download(directory=""):
    wrank = int(os.environ.get('RANK', 0))
    wsize = int(os.environ.get('WORLD_SIZE', 0))

    for chunk in range(1, CHUNKS + 1):
        shards_to_download = split_shards(wsize)

        for shard in shards_to_download[wrank]:
            filename = f'example_train_chunk{chunk}_shard{shard}.jsonl.zst'
            filename = os.path.join(directory, filename)
            url = f'https://huggingface.co/datasets/cerebras/SlimPajama-627B/resolve/main/train/chunk{chunk}/example_train_{shard}.jsonl.zst'
            download_shard(url, filename)
Extract

The individual dataset shards are compressed in the Zstandard or .zst format and must be decompressed. The following script distributes the downloaded files across all ranks, decompresses the shards, and then removes the compressed downloads to keep the PVC clean. Using the local VS Code session, save the script in the PVC as /nemo-workspace/nemo-run/extract.py.

import os
import torch
import zstandard as zstd
from glob import glob

def split_shards(wsize, dataset):
    shards = []

    for shard in range(wsize):
        idx_start = (shard * len(dataset)) // wsize
        idx_end = ((shard + 1) * len(dataset)) // wsize
        shards.append(dataset[idx_start:idx_end])
    return shards

def extract_shard(shard):
    extracted_filename = shard.replace(".zst", "")

    # Very rare scenario where another rank has already processed a shard
    if not os.path.exists(shard):
        return

    with open(shard, "rb") as in_file, open(extracted_filename, "wb") as out_file:
        dctx = zstd.ZstdDecompressor(max_window_size=2**27)
        reader = dctx.stream_reader(in_file)

        while True:
            chunk = reader.read(4096)
            if not chunk:
                break
            out_file.write(chunk)

    os.remove(shard)

def extract(directory=""):
    wrank = int(os.environ.get("RANK", 0))
    wsize = int(os.environ.get("WORLD_SIZE", 0))

    # Wait for all processes to be ready before starting
    torch.distributed.init_process_group()
    torch.distributed.barrier()

    dataset = sorted(glob(os.path.join(directory, "example_train*zst")))
    shards_to_extract = split_shards(wsize, dataset)

    for shard in shards_to_extract[wrank]:
        extract_shard(shard)
Concatenate

Given the SlimPajama dataset contains nearly 60,000 files, it is helpful to concatenate them into fewer, larger files. Processing a smaller number of large files is much faster than handling a large number of small files, which eliminates potential data bottlenecks during the pre-training stage.

The following script takes 1,200 individual shards at a time and combines them into one large file, repeating for the entire dataset. Each rank concatenates a unique subsection of the dataset and deletes the individual shards in the end. Using the local VS Code session, save the script in the PVC as /nemo-workspace/nemo-run/concat.sh.

Note

The script combines 1,200 individual shards by default into a single file. For the complete dataset, this will yield 50 larger combined files representing the data, each being approximately 51 GB in size. To change how many shards are used in each file, increase or decrease the shards_per_file variable below. A larger number will result in fewer files that are larger in size. A smaller number will result in more files that are smaller in size.

#!/bin/bash
directory=$1
shards_per_file=1200
num_files=`find ${directory} -name 'example_train_chunk*.jsonl' | wc -l`
files=(${directory}/example_train_chunk*.jsonl)
rank=$RANK
world_size=$WORLD_SIZE

# Find the ceiling of the result
shards=$(((num_files+shards_per_file-1)/shards_per_file ))

echo "Creating ${shards} combined chunk(s) comprising ${shards_per_file} files each"

for ((i=0; i<$shards; i++)); do
  if (( (( $i - $rank )) % $world_size )) ; then
    continue
  fi
  file_start=$((i*shards_per_file))

  if [[ $(((i+1)*shards_per_file)) -ge ${#files[@]} ]]; then
    file_stop=$((${#files[@]}-1))
  else
    file_stop=$(((i+1)*shards_per_file))
  fi

  echo "  Building chunk $i with files $file_start to $file_stop"
  for file in "${files[@]:$file_start:$shards_per_file}"; do
    cat "$file" >> "${directory}/slim_pajama_${i}.jsonl"
  done
  rm ${files[@]:$file_start:$shards_per_file}
done
Preprocess

Once all of the files have been concatenated, it is time to preprocess the dataset. The preprocessing phase tokenizes each dataset file using the Llama-3.1-8B tokenizer, which is downloaded from Hugging Face and creates .bin and .idx files for each concatenated file. As with the other scripts, this one divides the work amongst all available workers to speed up preprocessing. Using the local VS Code session, save the following script in the PVC as /nemo-workspace/nemo-run/preprocess.py.

Note

As mentioned, this script uses the Llama-3.1-8B tokenizer because the intent is to use this data for pre-training a Llama-3.1-8B model. However, the tokenizer can be swapped out for a different one available on Hugging Face if pre-training a different model is desired.

For example, the Mixtral-8x7B tokenizer from MistralAI can be used instead. To do this, replace both references of meta-llama/Meta-Llama-3.1-8B in the script with the repo ID of the Mixtral-8x7B model, mistralai/Mixtral-8x7B-v0.1. Additionally, update the filename and path to the tokenizer in the model repo, which is filename=tokenizer.model.

Be sure to accept any applicable licenses on the model repository page.

import os
import subprocess
from glob import glob

from huggingface_hub import hf_hub_download


def split_shards(wsize, dataset):
    shards = []

    for shard in range(wsize):
        idx_start = (shard * len(dataset)) // wsize
        idx_end = ((shard + 1) * len(dataset)) // wsize
        shards.append(dataset[idx_start:idx_end])
    return shards

def preprocess(directory=""):
    wrank = int(os.environ.get("RANK", 0))
    wsize = int(os.environ.get("WORLD_SIZE", 1))

    dataset = sorted(glob(os.path.join(directory, "slim_pajama*jsonl")))
    shards_to_extract = split_shards(wsize, dataset)

    if wrank == 0:
        # Download a specific file from a repository
        hf_hub_download(
            repo_id="meta-llama/Meta-Llama-3.1-8B",
            filename="original/tokenizer.model",
            local_dir="/nemo-workspace/tokenizers/llama-3.1-8b"
        )

    for num, shard in enumerate(shards_to_extract[wrank]):
        shard_num = wrank + (num * wsize)  # Counter for which file is processed
        output_path = os.path.join(directory, f"llama-slim-pajama-{shard_num}")
        command = (
            "python3 /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py "
            f"--input {shard} "
            f"--output-prefix {output_path} "
            f"--dataset-impl mmap "
            f"--tokenizer-type meta-llama/Meta-Llama-3.1-8B "
            f"--tokenizer-library huggingface "
            f"--tokenizer-model /nemo-workspace/tokenizers/llama-3.1-8b/tokenizer.model "
            f"--workers 80"
        )
        subprocess.run([command], shell=True)
Data Prep

A final script needs to be written to launch all of the data preparation jobs on the cluster. This uses NeMo-Run to authenticate with the DGX Cloud cluster and run distributed PyTorch jobs directly on the cluster. The jobs will be launched sequentially in the order they are called. Using the local VS Code session, save the following script in the PVC as /nemo-workspace/nemo-run/data-prep.py.

Several lines in the script below will need to be modified to reflect your cluster. The lines are as follows:

  • “claimName”: “<k8s PVC name here>”: Replace <k8s PVC name here> with the Kubernetes PVC name captured at the end of the PVC creation step in the beginning.

  • base_url=”https://<base-url>/api/v1”: Replace <base-url> with the URL of your DGX Cloud cluster. For example, if you access the web UI at “dgxcloud.nvidia.com”, replace <base-url> with dgxcloud.nvidia.com and keep the /api/v1/ endpoint at the end. The line should then be base_url=”https://dgxcloud.nvidia.com/api/v1”.

  • app_id=”nemo-run”: If you used a different name than what was shown in the Application creation step above, enter that name here. Otherwise, keep it as nemo-run per the guide.

  • app_secret=”XXXXXXXXXXXXXXXXXXXXXXX”: Specify the Application secret token shown in the web UI during Application creation between the quotation marks.

  • project_name=”nemo-fw”: Replace nemo-fw with the name of the Project to run the data prep job in. This can be found in the Projects section of the web UI.

  • “HF_TOKEN”: “xxxxxxxxxxxxxxxxxx”: Add your Hugging Face authentication token between the quotation marks.

  • executor = dgxc_executor(nodes=2, devices=8): The example runs on 2 nodes with 8 processes per node. If more nodes/processes are required, specify the amount here.

import nemo_run as run

from download import download
from extract import extract
from preprocess import preprocess

def dgxc_executor(nodes: int = 1, devices: int = 1) -> run.DGXCloudExecutor:
    pvcs = [
        {
            "name": "workspace",  # Default name to identify the PVC
            "path": "/nemo-workspace",  # Directory where PVC will be mounted in pods
            "existingPvc": True,  # The PVC already exists
            "claimName": "<k8s PVC name here>",  # Replace with the name of the PVC to use
        }
    ]

    return run.DGXCloudExecutor(
        base_url="https://<base-url>/api/v1",  # Base URL to send API requests
        app_id="nemo-run",  # Name of the Application
        app_secret="XXXXXXXXXXXXXXXXXXXXXXX",  # Application secret token
        project_name="nemo-fw",  # Name of the project within NVIDIA Run:ai
        nodes=nodes,  # Number of nodes to run on
        gpus_per_node=devices,  # Number of processes per node to use
        container_image="nvcr.io/nvidia/nemo:25.02",  # Which container to deploy
        pvcs=pvcs,  # Attach the PVC(s) to the pod
        env_vars={
            "PYTHONPATH": "/nemo-workspace/nemo-run:$PYTHONPATH",  # Add the NeMo-Run directory to the PYTHONPATH
            "HF_TOKEN": "xxxxxxxxxxxxxxxxxx"  # Add your Hugging Face API token here
        },
        launcher="torchrun"  # Use torchrun to launch the processes
    )

def prepare_slim_pajama():
    executor = dgxc_executor(nodes=2, devices=8)

    # Create a NeMo-Run experiment which runs all sub-steps sequentially
    with run.Experiment("slim-pajama-data-prep") as exp:
        exp.add(run.Partial(download, "/nemo-workspace/data"), name="download", executor=executor)
        exp.add(run.Partial(extract, "/nemo-workspace/data"), name="extract", executor=executor)
        exp.add(run.Script("/nemo-workspace/nemo-run/concat.sh", args=["/nemo-workspace/data"]), executor=executor)
        exp.add(run.Partial(preprocess, "/nemo-workspace/data"), name="preprocess", executor=executor)

        # Launch the experiment on the cluster
        exp.run(sequential=True)

if __name__ == "__main__":
    prepare_slim_pajama()

4.4.2. Launch Data Preparation#

Once all the scripts are saved in the specified location, it is time to launch the preprocessing job. NeMo-Run will launch the job automatically on the cluster, so starting data preparation is as simple as running a Python command. Launch data preparation with the following command in the terminal of your local VS Code session:

cd /nemo-workspace/nemo-run
mkdir -p /nemo-workspace/data
chmod +x concat.sh
export NEMORUN_HOME=/nemo-workspace/nemo-run
python3 data-prep.py

After creating the data preparation job, a pod for each worker and primary will be scheduled and started once resources become available on the cluster. The process can be monitored by viewing the logs in the NVIDIA Run:ai UI and by connecting to the helper JupyterLab terminal and viewing the data in the PVC. The /nemo-workspace will evolve throughout the process with the following changes at the end of each stage:

  • After downloading, there will be 59,166 compressed data shards named example_train_chunkX_shardY.jsonl.zst where X is the chunk number from 1-10 and Y is the individual shard number within that chunk. Each file is approximately 15 MB in size.

  • After extraction, there will be 59,166 unzipped data shards named example_train_chunkX_shardY.jsonl and all of the compressed .zst files will be removed. Each file is approximately 44 MB in size.

  • After concatenation, there will be 50 large, combined files named slim_pajama_N.jsonl where N ranges from 0-49. Each file will be approximately 51 GB in size. It is normal for the last file to be smaller in size as it doesn’t contain an even 1,200 shards. All of the individual example_train* files will be removed.

  • After preprocessing, there will be 50 .bin files and 50 .idx files named llama-slim-pajama-N_text_document, where N corresponds to the combined data file number. Each .bin file should be approximately 26 GB in size and .idx files should be 229 MB.

Once all 50 files have been preprocessed, it is time to begin pre-training the model.

4.5. Pre-Train the Model#

NeMo Framework contains many predefined configurations for various models, including the Llama 3.1-8B model. This section will demonstrate how to initiate training a Llama 3.1-8B model on NVIDIA Run:ai using the preprocessed SlimPajama dataset.

Pre-training is the most compute-intensive phase of the LLM training process as the model is typically trained for hundreds of billions to trillions of tokens while it learns the vocabulary and word pairings of the underlying dataset. Depending on the size of the dataset and model as well as the amount of compute resources available to train the model, this process can take anywhere from several days to a few months to finish. Therefore it is strongly recommended to leverage as much of your available compute power as possible for pre-training the model.

4.5.1. Set Up the Environment#

Some minor setup is required prior to launching the job. If you skipped data preparation earlier and need to install NeMo-Run, run the following command in your local VS Code session’s terminal:

cd /nemo-workspace
git clone https://github.com/nvidia/nemo-run
cd nemo-run

Now the training job can be defined. The following script is used to launch pre-training of a Llama 3.1-8B model for 629B tokens using the SlimPajama dataset that was prepared. Copy the script to /nemo-workspace/nemo-run/llama-pretrain.py using your local VS Code session. Note, as with data preparation earlier, several lines will need to be modified to reflect your cluster. These lines are as follows:

  • “claimName”: “<k8s PVC name here>”: Replace <k8s PVC name here> with the Kubernetes PVC name captured at the end of the PVC creation step in the beginning.

  • base_url=”https://<base-url>/api/v1”: Replace <base-url> with the URL of your DGX Cloud cluster. For example, if you access the web UI at “dgxcloud.nvidia.com”, replace <base-url> with dgxcloud.nvidia.com and keep the /api/v1/ endpoint at the end. The line should then be base_url=”https://dgxcloud.nvidia.com/api/v1”.

  • app_id=”nemo-run”: If you used a different name than what was shown in the Application creation step above, enter that name here. Otherwise, keep it as nemo-run per the guide.

  • app_secret=”XXXXXXXXXXXXXXXXXXXXXXX”: Specify the Application secret token shown in the web UI during Application creation between the quotation marks.

  • project_name=”nemo-fw”: Replace nemo-fw with the name of the Project to run the data prep job in. This can be found in the Projects section of the web UI.

  • “HF_TOKEN”: “xxxxxxxxxxxxxxxxxx”: Add your Hugging Face authentication token between the quotation marks.

  • “WANDB_API_KEY”: “xxxxxxxxxxxxxxxxxx”: Add your Weights & Biases authentication token between the quotation marks.

import os
import nemo_run as run

from nemo.collections import llm
from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer
from nemo.collections.llm.gpt.data.pre_training import PreTrainingDataModule
from nemo.collections.llm.recipes.log.default import default_log, wandb_logger
from nemo.collections.llm.recipes.optim.adam import distributed_fused_adam_with_cosine_annealing
from nemo.utils.exp_manager import TimingCallback

from glob import glob


def configure_recipe(nodes: int = 1, gpus_per_node: int = 2, dir=None, name="nemo"):
    paths = glob("/nemo-workspace/data/llama*bin")
    paths = [path.replace(".bin", "") for path in paths]
    tokenizer = run.Config(AutoTokenizer, pretrained_model_name="meta-llama/Llama-3.1-8B")

    data=run.Config(
        PreTrainingDataModule,
        paths=paths,
        seq_length=8192,  # Use a sequence length or context window of 8K tokens
        global_batch_size=512,  # Batch size of 512
        micro_batch_size=1,
        tokenizer=tokenizer
    )

    wandb = wandb_logger(
        project="llama-3.1",  # Specify the Weights & Biases project name
        name="llama-3.1-8b"  # Specify the name of the training run to be displayed on Weights & Biases
    )

    recipe = run.Partial(
        llm.pretrain,  # Specify that we want to use the Pre-train method
        model=llm.llama31_8b.model(),  # Use the existing Llama 3.1-8B model config default settings
        trainer=llm.llama31_8b.trainer(
            num_nodes=nodes,
            num_gpus_per_node=gpus_per_node,
            max_steps=150000,  # Train for 150,000 steps - equal to 150,000 * batch size (512) * sequence length (8192) = 629B tokens
            callbacks=[run.Config(TimingCallback)],
        ),
        data=data,
        optim=distributed_fused_adam_with_cosine_annealing(max_lr=3e-4),
        log=default_log(dir=dir, name=name, wandb_logger=wandb),
    )

    recipe.trainer.val_check_interval = 2000  # Run evaluation and save a checkpoint every 2,000 steps
    recipe.trainer.strategy.tensor_model_parallel_size = 4  # Set the Tensor Parallelism size to 4
    return recipe

def dgxc_executor(nodes: int = 1, devices: int = 1) -> run.DGXCloudExecutor:
    pvcs = [
        {
            "name": "workspace",  # Default name to identify the PVC
            "path": "/nemo-workspace",  # Directory where PVC will be mounted in pods
            "existingPvc": True,  # The PVC already exists
            "claimName": "<k8s PVC name here>",  # Replace with the name of the PVC to use
        }
    ]

    return run.DGXCloudExecutor(
        base_url="https://<base-url>/api/v1",  # Base URL to send API requests
        app_id="nemo-run",  # Name of the Application
        app_secret="XXXXXXXXXXXXXXXXXXXXXXX",  # Application secret token
        project_name="nemo-fw",  # Name of the project within NVIDIA Run:ai
        nodes=nodes,  # Number of nodes to run on
        gpus_per_node=devices,  # Number of processes per node to use
        container_image="nvcr.io/nvidia/nemo:25.02",  # Which container to deploy
        pvcs=pvcs,  # Attach the PVC(s) to the pod
        env_vars={
            "PYTHONPATH": "/nemo-workspace/nemo-run:$PYTHONPATH",  # Add the NeMo-Run directory to the PYTHONPATH
            "HF_TOKEN": "xxxxxxxxxxxxxxxxxx",  # Add your Hugging Face API token here
            "WANDB_API_KEY": "xxxxxxxxxxxxxxxxxx"  # Add your Weights & Biases API token here
        },
        launcher="torchrun"  # Use torchrun to launch the processes
    )

def last_checkpoint(directory=""):
    for root, dirs, files in os.walk(directory):
        for dir in dirs:
            if dir.endswith("-last"):
                return os.path.join(root, dir)

def convert_checkpoint(dir=""):
    checkpoint = last_checkpoint(dir)

    export_ckpt = run.Partial(
        llm.export_ckpt,
        path=checkpoint,
        target="hf",
        overwrite=True,
        output_path=f"{dir}/huggingface"
    )
    return export_ckpt

def run_pretraining():
    recipe = configure_recipe(nodes=8, gpus_per_node=8, dir="/nemo-workspace/llama-3.1-8b", name="llama-3.1-8b")
    executor = dgxc_executor(nodes=recipe.trainer.num_nodes, devices=recipe.trainer.devices)

    run.run(recipe, executor=executor)

    export_ckpt = convert_checkpoint(dir="/nemo-workspace/llama-3.1-8b")
    # Re-initialize the executor as only a single GPU is needed for conversion
    executor = dgxc_executor(nodes=1, devices=1)

    run.run(export_ckpt, executor=executor)

if __name__ == "__main__":
    run_pretraining()

Depending on how many resources you have available, you can also change the number of nodes used for pre-training by modifying this line:

recipe = configure_recipe(nodes=8, gpus_per_node=8, dir="/nemo-workspace/llama-3.1-8b", name="llama-3.1-8b")

Update the nodes=8 line to the desired number of nodes to train with. Keep gpus_per_node at 8 as this allows optimal multi-node communication over NCCL.

4.5.2. Launch the Pre-Training Job#

After modifying and saving the script to /nemo-workspace/nemo-run/llama-pretrain.py, launch the pre-training job from the terminal in your local VS Code session with:

cd /nemo-workspace/nemo-run
export NEMORUN_HOME=/nemo-workspace/nemo-run
python3 llama-pretrain.py

The job will now be scheduled with NVIDIA Run:ai and launched once resources become available. The job will appear in the NVIDIA Run:ai Workloads page after submission. The following images show the workload details after it has been running for a few days.

Llama event history Llama resource usage

NeMo Framework is fully integrated with Weights & Biases and logs multiple metrics that can be viewable on their website. If the W&B key was provided in the command, a new W&B project will automatically be created and metrics will be uploaded there. Viewing logs on W&B is recommended as the best path to monitor training progress.

4.5.3. View the Project Dashboard on Weights & Biases#

To view your charts, navigate to https://wandb.ai. You should see a link to the newly created project on your home page. Clicking the link will take you to your project dashboard which should look similar to the following. Note that the figure below includes training results for two different runs where the second run is a continuation of the first.

W&B pre-training charts

Two of the most important charts to monitor during pre-training are the reduced_train_loss and val_loss charts which show how the model is learning over time. In general, these charts should have an exponential decay shape.

The job will take around four weeks to complete on 8 nodes. Since NeMo Framework pre-training scales linearly, doubling the number of nodes should halve the amount of time required to pre-train the model.

While the model trains, a checkpoint will be saved every 2,000 steps in the PVC. Per the command above, the checkpoints will be saved in the /nemo-workspace/llama-3.1-8b/llama-3.1-8b/<date>/checkpoints directory where <date>> is a timestamp of when the job was launched. Only the 10 checkpoints with the best val_loss values as well as the latest checkpoint will be saved. These checkpoints will be used for future fine-tuning runs.

After pre-training finishes, another task will begin to convert the final pre-trained model checkpoint to the Hugging Face format. This spins up another pod with a single GPU which is required for conversion. The final Hugging Face model will be saved at /nemo-workspace/llama-3.1-8b/huggingface. The converted Hugging Face model can be deployed as a NIM for inference.

4.6. Deploy the Model for Inference#

Now that we have finished pre-training a base model and converted it to the Hugging Face format, we can deploy it for inference as a NIM and send requests to the deployed model to do quick human evaluations.

Warning

This section is NOT intended for production inference deployments. The purpose of this section is to provide a quick way for engineers, QA teams, and other internal stakeholders to evaluate the model with user-generated prompts and inform decisions on the model’s readiness. A production deployment would include load balancing, auto-scaling, user tracking, and more.

To deploy the model for inference, navigate to the Workloads page, click the + New Workload > Inference button, and follow these steps:

  1. In the new form that opens, select the desired project to run the job in.

  2. Enter a name for the inference deployment, such as llama-31-8b-base-model-deploy and click the Continue button.

  3. Create a new environment by clicking the + New Environment button.

  4. In the environment creation page, enter a name for the environment, such as llama-31-8b-nim and optionally add a description.

  5. For the image URL, enter nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.0-RTX, which is the latest NIM for Llama-3.1-8B at the time of writing. As newer containers are released, the tag can be updated to reflect the latest version. If using a different base model, find the corresponding NIM image URL at ngc.nvidia.com.

  6. In the Endpoint section, ensure HTTP is selected for the protocol. Enter 8000 for the container port. This is the default port where NIM containers listen to requests.

  7. Once finished setting up the environment, click the Create Environment button at the bottom of the page which will take you back to the worker setup form.

  8. Ensure the newly created llama-31-8b-nim environment is selected in the Environment section.

  9. In the Runtime settings section for the environment, create the following new environment variables:

    • Name: NIM_MODEL_NAME, Source: Custom, Value: /nemo-workspace/llama-3.1-8b/huggingface

    • Name: NIM_SERVED_MODEL_NAME, Source: Custom, Value: llama-3.1-8b

    Note, if deploying a different model, the values above can be changed to fit your model. These values are as follows:

    • NIM_MODEL_NAME: The path to your converted Hugging Face model to deploy. If using a different model or path, update that here.

    • NIM_SERVED_MODEL_NAME: The name under which the model will be deployed. This name will be used when sending requests.

  10. In the Compute resource section select the compute type that includes one GPU as the model needs just one GPU to fit in GPU memory. If using a different model, the number of GPUs should match the tensor parallelism size.

  11. In the Data sources section select the PVC that was created earlier in this example. This will mount the PVC at /nemo-workspace inside the pod. Click the Create Inference button at the bottom of the page to create the deployment.

Back in the Workloads page you will see the newly created inference workload. It will take some time for the model to be converted to TRT-LLM engines before it transitions to the Running state. Once the deployment is running, it can start to handle requests.

4.6.1. Send Requests to the Deployed Model#

The easiest way to send a request to the deployed model is via curl in a terminal from an IP address in the cluster’s allowlist. The basic request structure is as follows:

curl -X POST https://X.X.X.X/v1/chat/completions/ \
  -H 'content-type: application/json' \
  -H 'accept: application/json' \
  -d '{"model": "<NIM_MODEL_NAME>", "prompt": "Write me a short story about a baby dragon that learns to fly", "top_p": 1, "n": 1, "max_tokens": 4096, "stream": false, "frequency_penalty": 1.0}'

To find the URL, run kubectl get ksvc in a terminal with the kubeconfig for the cluster configured. Find the knative service that corresponds to the deployment. The URL will be in the second column. In the following example, the URL would be https://llama-31-8b-base-model-deploy-runai-demo-project.inference.<cluster>.ai:

$ kubectl get ksvc

NAME                            URL                                                                               LATESTCREATED                         LATESTREADY                           READY   REASON
llama-31-8b-base-model-deploy   https://llama-31-8b-base-model-deploy-runai-demo-project.inference.<cluster>.ai   llama-31-8b-base-model-deploy-00001   llama-31-8b-base-model-deploy-00001   True

In the above curl command, replace X.X.X.X with the service IP address captured in the previous step. Additionally, replace Write me a short story about a baby dragon that learns to fly with your prompt of choice, and <NIM_MODEL_NAME> with the name of your served model specified during deployment. This command will generate 4096 tokens but this can be changed as needed depending on the prompt.

After submitting the command, it will be passed to the deployed model, which will generate a response to the prompt.

The response should look similar to the following (response truncated - actual responses will vary):

{"output":"and having adventures.\nAsked by: Dayanida (6 years, 4 months ago)\nEdit: I am drawing it with Paint Tool SAI and Photoshop CS3.\nUpdated to try and get better.\nAnswered by: Rebecca (12 years, 5 months ago)\nWrite me a story about an adventure in the land of Wandreon where you can choose your own adventure..."}

The model’s response will be in the output key and will follow directly after the last token in the prompt. For example, combining the end of the input prompt and the start of the response would be “…that learns to fly and having adventures…”

4.6.2. Clean Up#

When the deployment is no longer needed, it can be stopped to free up additional compute resources.

To stop the job, go to the Workloads page on NVIDIA Run:ai and select the select the llama-31-8b-base-model-deploy job and click the Delete button towards the top-left of the panel.