Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

NeMo Framework Post-Training Quantization (PTQ) with Nemotron4 and Llama3

Project Description

Learning Goals

Post-training quantization (PTQ) is a technique in machine learning that reduces a trained model’s memory and computational footprint. In this playbook, you’ll learn how to apply PTQ to two Large Language Models (LLMs), Nemotron4-340B and Llama3-70B, enabling export to TRTLLM and deployment via PyTriton in FP8 precision for efficient serving.

NeMo Tools and Resources

NeMo Github repo
NeMo-Framework-Launcher repo
NeMo Framework Training container: nvcr.io/nvidia/nemo:24.07

Software Requirements

Use the latest NeMo Framework Training container
This playbook has been tested on: nvcr.io/nvidia/nemo:24.07. It is expected to work similarly on other environments.

Hardware Requirements

NVIDIA DGX H100 and NVIDIA H100 GPUs based on the NVIDIA Hopper architectures.

Prepare the NeMo Checkpoint for Nemotron4-340B and Llama3-70B

Prepare the Nemotron4-340B Checkpoint

You can download Nemotron4-340B from huggingface/nvidia.

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="nvidia/Nemotron-4-340B-Base",
    local_dir="nemotron4-340b-base",
    local_dir_use_symlinks=False,
)

Prepare the Llama3-70B Checkpoint

You can download the Llama3-70B checkpoint from huggingface/meta-llama.

To download the checkpoint, you need to be approved by the Meta Llama3 Community License Agreement.

To download the model, use your Hugging Face API token.

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="meta-llama/Meta-Llama-3-70B",
    local_dir="llama3-70b-base",
    local_dir_use_symlinks=False,
    token=<YOUR HF TOKEN>,
)

Convert the Llama3-70B HF checkpoint into .nemo format Run the container using the following command:

docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.07 bash

Convert the Hugging Face model to the .nemo model:

python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
    --input_name_or_path=./llama3-70b-base \
    --output_path=llama3-70b-base.nemo

Extract .nemo to a folder to avoid a memory issue while loading the model.

mkdir llama3-70b-base-nemo && tar -xvf llama3-70b-base.nemo -C llama3-70b-base-nemo

Convert NeMo Checkpoint to qnemo Format

“.nemo” versus “.qnemo”

NeMo provides a Post-Training Quantization that allows you to convert regular .nemo models into a TensorRT-LLM checkpoint. These checkpoints are conventionally referred to as .qnemo checkpoints in NeMo. You can use these .qnemo checkpoints with the NVIDIA TensorRT-LLM library for efficient inference.

A .qnemo checkpoint, similar to .nemo checkpoints, is a tar file that includes the model configuration from the config.json file and separate rank{i}.safetensors files containing model weights for each rank. Additionally, it saves a tokenizer_config.yaml file, which corresponds to the tokenizer section in the original NeMo model’s model_config.yaml. This configuration file defines the tokenizer used by the given model.

When exporting large models, it is recommended to use a directory instead of a tar file. You can control this behavior using the compress flag in the PTQ configuration file.

Run Calibration to Generate the qnemo Model

Calibrate the Nemotron4-340B Model

Calibrating the Nemotron4-340B model requires at least two DGX-8H100 GPUs.

To submit a job through the NeMo Framework Launcher:

cd NeMo-Framework-Launcher/launcher_scripts

CALIB_PP=2
CALIB_TP=8
INFER_TP=8

python3 main.py \
    ptq=model/quantization \
    stages=["ptq"] \
    launcher_scripts_path=$(pwd) \
    base_results_dir=/results/base \
    "container='${CONTAINER}'" \
    container_mounts=[/models,/results] \
    cluster.partition=${SLURM_PARTITION} \
    cluster.account=${SLURM_ACCOUNT} \
    cluster.job_name_prefix="${SLURM_ACCOUNT}-nemotron_340b_fp8:" \
    cluster.gpus_per_task=null \
    cluster.gpus_per_node=null \
    cluster.srun_args='["--no-container-mount-home", "--mpi=pmix"]' \
    ptq.run.model_train_name=nemotron_340b \
    ptq.run.time_limit=45 \
    ptq.run.results_dir=/results \
    ptq.quantization.algorithm=fp8 \
    ptq.export.decoder_type=gptnext \
    ptq.export.inference_tensor_parallel=${INFER_TP} \
    ptq.export.inference_pipeline_parallel=1 \
    ptq.trainer.precision=bf16 \
    ptq.model.restore_from_path=/models/nemotron4-340b-base \
    ptq.export.save_path=/results/nemotron4-340b-base-fp8-qnemo \
    ptq.model.tensor_model_parallel_size=${CALIB_TP} \
    ptq.model.pipeline_model_parallel_size=${CALIB_PP}

Note

Cluster settings might differ depending on your hardware environment. Consult the NeMo Framework Launcher documentation for cluster-related settings.

Calibrate the Llama3-70B Model

Calibrating the Llama3 70B model requires at least eight H100 GPUs.

To submit the job through NeMo:

python /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_ptq.py \
    model.restore_from_path=llama3-70b-base-nemo \
    model.tensor_model_parallel_size=4 \
    model.pipeline_model_parallel_size=1 \
    trainer.num_nodes=1 \
    trainer.precision=bf16 \
    trainer.devices=8 \
    quantization.algorithm=fp8 \
    export.decoder_type=llama \
    export.inference_tensor_parallel=2 \
    export.save_path=llama3-70b-base-fp8-qnemo

Note

Ensure that you run the above scripts within the NeMo Docker container``nvcr.io/nvidia/nemo:24.07``.

In the case of Nemotron4-340B, the output directory stores the following files:

nemotron4-340b-base-fp8-qnemo
├── config.json
├── rank0.safetensors
├── rank1.safetensors
├── rank2.safetensors
├── rank3.safetensors
├── rank4.safetensors
├── rank5.safetensors
├── rank6.safetensors
├── rank7.safetensors
├── tokenizer.model
└── tokenizer_config.yaml

The output for Llama3-70B is similar.

Export to TensorRT-LLM

Two options are available to export the quantized model to TensorRT-LLM engine. You can use nemo.export module or directly TensorRT-LLM trtllm-build command.

Export to TensorRT-LLM through `nemo.export`

To build and run the TensorRT-LLM engine using the TensorRTLLM class in the nemo.export submodule:

from nemo.export.tensorrt_llm import TensorRTLLM

# Export Nemotron4-340B model
trt_llm_exporter = TensorRTLLM(model_dir="nemotron4-340b-base-fp8-trt-llm-engine")
trt_llm_exporter.export(nemo_checkpoint_path="nemotron4-340b-base-fp8-qnemo")
trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])

# Export Llama3-70B model
trt_llm_exporter = TensorRTLLM(model_dir="llama3-70b-base-fp8-trt-llm-engine")
trt_llm_exporter.export(nemo_checkpoint_path="llama3-70b-base-fp8-qnemo")
trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])

Export to TensorRT-LLm using `trtllm-build`

To build directly using the trtllm-build command, refer to the TensorRT-LLM documentation for more information:

# Build Nemotron4-340B TRTLLM Engine
trtllm-build \
    --checkpoint_dir nemotron4-340b-base-fp8-qnemo \
    --output_dir nemotron4-340b-base-fp8-trt-llm-engine \
    --max_batch_size 8 \
    --max_input_len 2048 \
    --max_output_len 512 \
    --strongly_typed

The command for Llama3-70B is similar.

As an example, the TensorRT-LLM Engine files will be stored to nemotron4-340b-base-fp8-trt-llm-engine and llama3-70b-base-fp8-trt-llm-engine.

Deploy Nemotron/Llama TensorRT-LLM to Triton

To deploy a TensorRT-LLM model to Triton, use the APIs in the deploy module:

from nemo.export.tensorrt_llm import TensorRTLLM
from nemo.deploy import DeployPyTriton

# Deploy Nemotron Model
trt_llm_exporter = TensorRTLLM(model_dir="nemotron4-340b-base-fp8-trt-llm-engine")

nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="nemotron4-340b", port=8000)
nm.deploy()
nm.serve()

# Deploy Llama3 Model
trt_llm_exporter = TensorRTLLM(model_dir="llama3-70b-base-fp8-trt-llm-engine")

lm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="llama3-70b", port=8008)
lm.deploy()
lm.serve()

NeMo Framework provides NemoQueryLLM APIs to send a query to the Triton server for convenience. These APIs are only accessible from the NeMo Framework container.

To send a query to the Triton server:

from nemo.deploy.nlp import NemoQueryLLM

# Send a Query to nemotron server
nq = NemoQueryLLM(url="localhost:8000", model_name="nemotron4-340b")
output = nq.query_llm(
    prompts=["What is the capital of United States?"],
    max_output_len=10,
    top_k=1,
    top_p=0.0,
    temperature=1.0,
)
print(output)

# Send a query to llama server
nq = NemoQueryLLM(url="localhost:8008", model_name="llama3-70b")
output = nq.query_llm(
    prompts=["What is the capital of United States?"],
    max_output_len=10,
    top_k=1,
    top_p=0.0,
    temperature=1.0
)
print(output)

To learn more about various deployment methods, refer to the Deploy NeMo Framework Models documentation.

NeMo Framework Post-Training Quantization (PTQ) with Nemotron4 and Llama3

Project Description

Learning Goals

NeMo Tools and Resources

Software Requirements

Hardware Requirements

Prepare the NeMo Checkpoint for Nemotron4-340B and Llama3-70B

Prepare the Nemotron4-340B Checkpoint

Prepare the Llama3-70B Checkpoint

Convert NeMo Checkpoint to qnemo Format

“.nemo” versus “.qnemo”

Run Calibration to Generate the qnemo Model

Calibrate the Nemotron4-340B Model

Calibrate the Llama3-70B Model

Export to TensorRT-LLM

Export to TensorRT-LLM through nemo.export

Export to TensorRT-LLm using trtllm-build

Deploy Nemotron/Llama TensorRT-LLM to Triton

Export to TensorRT-LLM through `nemo.export`

Export to TensorRT-LLm using `trtllm-build`