NeMo Framework Post-Training Quantization (PTQ) with Nemotron4 and Llama3

Project Description

Learning Goals

Post-training quantization (PTQ) is a technique in machine learning that reduces a trained model’s memory and computational footprint. In this playbook, you’ll learn how to apply PTQ to two large language models (LLMs), Nemotron4-340B and Llama3-70B, enabling export to TRTLLM and deployment via PyTriton in FP8 precision for efficient serving.

NeMo Tools and Resources

NeMo Github repo
NeMo-Framework-Launcher repo
NeMo Framework Training container: nvcr.io/nvidia/nemo:24.05

Software Requirements

Use the latest NeMo Framework Training container
This playbook has been tested on: nvcr.io/nvidia/nemo:24.05. It is expected to work similarly on other environments.

Hardware Requirements

NVIDIA DGX H100 and NVIDIA H100 GPUs based on the NVIDIA Hopper architectures.

Preparing NeMo checkpoint for Nemotron4-340B and Llama3-70B

Nemotron4-340B Checkpoint Preparation

Nemotron4-340B can be downloaded via huggingface/nvidia

Copy
Copied!

            
            from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="nvidia/Nemotron-4-340B-Base",
    local_dir="nemotron4-340b-base",
    local_dir_use_symlinks=False
)

Llama3-70B Checkpoint Preparation

Llama3-70B can be downloaded via huggingface/meta-llama

User needs to be approved by Meta Llama3 Community License Agreement first in order to download the checkpoint. Use your HuggingFace API token to download the model.

Copy
Copied!

            
            from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="meta-llama/Meta-Llama-3-70B",
    local_dir="llama3-70b-base",
    local_dir_use_symlinks=False,
    token=<YOUR HF TOKEN>
)

Convert the Llama3-70B HF checkpoint into .nemo format

Run the container using the following command:

Copy
Copied!

            
            docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.05 bash

Convert the Hugging Face model to .nemo model:

Copy
Copied!

            
            python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path=./llama3-70b-base/ --output_path=llama3-70b-base.nemo

Extract the .nemo to a folder to avoid memory issue while loading the model

Copy
Copied!

            
            tar -xvf llama3-70b-base.nemo -C llama3-70b-base-nemo/

Convert NeMo Checkpoint to qnemo format

“.nemo” versus “.qnemo”

NeMo also offers Post-Training Quantization workflow to convert regular .nemo models into a TensorRT-LLM checkpoint conventionally referred to as .qnemo checkpoints in NeMo. Such a checkpoint can be used with NVIDIA TensorRT-LLM library for efficient inference.

Much as in the case of .nemo checkpoints, a .qnemo checkpoint is a tar file that bundles the model configuration given in config.json file and rank{i}.safetensors files storing model weights for each rank separately. Additionally a tokenizer_config.yaml file is saved which is just tokenizer section from model_config.yaml file from the original NeMo model. This configuration file defines a tokenizer for the model given.

For large quantized LLM using a directory rather than a tar file is recommended. This can be controlled with compress flag on exporting quantized models in PTQ configuration file.

Running Calibration to Generate qnemo Model

Calibrating Nemotron4-340B model requires at least two DGX-8H100 GPUs, job can be submitted through Nemo-Framework-Launcher:

Copy
Copied!

            
            cd NeMo-Framework-Launcher/launcher_scripts

CALIB_PP=2
CALIB_TP=8
INFER_TP=8

python3 main.py \
   ptq=model/quantization \
   stages=["ptq"] \
   launcher_scripts_path=$(pwd) \
   base_results_dir=/results/base \
   "container='${CONTAINER}'" \
   container_mounts=[/models,/results] \
   cluster.partition=batch \
   cluster.job_name_prefix="${SLURM_ACCOUNT}-nemotron_340b_fp8:" \
   cluster.gpus_per_task=null \
   cluster.gpus_per_node=null \
   cluster.srun_args='["--no-container-mount-home", "--mpi=pmix"]' \
   ptq.run.model_train_name=nemotron_340b \
   ptq.run.time_limit=45 \
   ptq.run.results_dir=/results \
   ptq.quantization.algorithm=fp8 \
   ptq.export.decoder_type=gptnext \
   ptq.export.inference_tensor_parallel=${INFER_TP} \
   ptq.export.inference_pipeline_parallel=1 \
   ptq.trainer.precision=bf16 \
   ptq.model.restore_from_path=/models/nemotron4-340b-base \
   ptq.export.save_path=/results/nemotron4-340B-base-fp8-qnemo \
   ptq.model.tensor_model_parallel_size=${CALIB_TP} \
   ptq.pipeline_model_parallel_size=${CALIB_PP}

Note

cluster settings might differ depending on your hardware environment, consult NeMo-Framework-Launcher documentation for cluster related settings.

Calibrating Llama3 70B model rquires at least eight H100 GPUs, job can be directly launched through NeMo:

Copy
Copied!

            
            python examples/nlp/language_modeling/megatron_gpt_quantization.py \
    model.restore_from_path=llama3-70b-base-nemo \
    model.tensor_model_parallel_size=2 \
    model.pipeline_model_parallel_size=1 \
    trainer.num_nodes=1 \
    trainer.precision=bf16 \
    trainer.devices=8 \
    quantization.algorithm=fp8 \
    export.decoder_type=llama \
    export.inference_tensor_parallel=1 \
    export.model_save=llama3-70b-base-fp8-qnemo

Note

The above scripts should be run within the NeMo Docker Container nvcr.io/nvidia/nemo:24.05.

The output directory stores the following files in the case of Nemotron4 340B:

Copy
Copied!

            
            nemotron4-340B-base-fp8-qnemo
├── config.json
├── rank0.safetensors
├── rank1.safetensors
├── rank2.safetensors
├── rank3.safetensors
├── rank4.safetensors
├── rank5.safetensors
├── rank6.safetensors
├── rank7.safetensors
├── tokenizer.model
└── tokenizer_config.yaml

The output in the case of Llama3-70B is analogous.

Exporting to TensorRT-LLM

Option1: Through `nemo.export`

The TensorRT-LLM engine can be conveniently built and run using TensorRTLLM class available in nemo.export submodule:

Copy
Copied!

            
            from nemo.export import TensorRTLLM

# Export Nemotron4-340B model
trt_llm_exporter = TensorRTLLM(model_dir="NEMOTRON4-340B-base-fp8-trt-llm-engine")
trt_llm_exporter.export(
    nemo_checkpoint_path="nemotron4-340B-base-fp8-qnemo",
    model_type="gptnext",
)
trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])

# Export Llama3-70B model
trt_llm_exporter = TensorRTLLM(model_dir="llama3-70b-base-fp8-trt-llm-engine")
trt_llm_exporter.export(
    nemo_checkpoint_path="llama3-70b-base-fp8-qnemo",
    model_type="llama",
)
trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])

Option2: Through trt-build

Alternatively, it can also be built directly using trtllm-build command, see TensorRT-LLM documentation:

Copy
Copied!

            
            # Build Nemotron4-340B TRTLLM Engine
trtllm-build \
    --checkpoint_dir nemotron4-340B-base-fp8-qnemo \
    --output_dir NEMOTRON4-340B-base-fp8-trt-llm-engine \
    --max_batch_size 8 \
    --max_input_len 2048 \
    --max_output_len 512 \
    --strongly_typed

The command for Llama3-70B is analogous.

TensorRT-LLM Engine files will be stored to NEMOTRON4-340B-base-fp8-trt-llm-engine and llama3-70b-base-fp8-trt-llm-engine, for example.

Deploy Nemotron/Llama TensorRT-LLM to Triton

You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Triton.

Copy
Copied!

            
            from nemo.export import TensorRTLLM
from nemo.deploy import DeployPyTriton

# Deploy Nemotron Model
trt_llm_exporter = TensorRTLLM(model_dir="NEMOTRON4-340B-base-fp8-trt-llm-engine")

nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="nemotron4-340b", port=8000)
nm.deploy()
nm.serve()

# Deploy Llama3 Model
trt_llm_exporter = TensorRTLLM(model_dir="llama3-70b-base-fp8-trt-llm-engine")

lm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="llama3-70b", port=8008)
lm.deploy()
lm.serve()

The NeMo Framework provides NemoQueryLLM APIs to send a query to the Triton server for convenience. These APIs are only accessible from the NeMo Framework container.

Copy
Copied!

            
            from nemo.deploy.nlp import NemoQueryLLM

# Send a Query to nemotron server
nq = NemoQueryLLM(url="localhost:8000", model_name="nemotron4-340b")
output = nq.query_llm(prompts=["What is the capital of United States?"], max_output_token=10, top_k=1, top_p=0.0, temperature=1.0)
print(output)

# Send a query to llama server
nq = NemoQueryLLM(url="localhost:8008", model_name="llama3-70b")
output = nq.query_llm(prompts=["What is the capital of United States?"], max_output_token=10, top_k=1, top_p=0.0, temperature=1.0)
print(output)

For more information on a variety of ways of deployment, refer to Deploy NeMo Framework Models