Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
NeMo Framework Post-Training Quantization (PTQ) with Nemotron4 and Llama3
Project Description
Learning Goals
Post-training quantization (PTQ) is a technique in machine learning that reduces a trained model’s memory and computational footprint. In this playbook, you’ll learn how to apply PTQ to two Large Language Models (LLMs), Nemotron4-340B and Llama3-70B, enabling export to TRTLLM and deployment via PyTriton in FP8 precision for efficient serving.
NeMo Tools and Resources
NeMo Framework Training container:
nvcr.io/nvidia/nemo:24.07
Software Requirements
Use the latest NeMo Framework Training container
This playbook has been tested on:
nvcr.io/nvidia/nemo:24.07
. It is expected to work similarly on other environments.
Hardware Requirements
NVIDIA DGX H100 and NVIDIA H100 GPUs based on the NVIDIA Hopper architectures.
Prepare the NeMo Checkpoint for Nemotron4-340B and Llama3-70B
Prepare the Nemotron4-340B Checkpoint
You can download Nemotron4-340B from huggingface/nvidia.
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="nvidia/Nemotron-4-340B-Base",
local_dir="nemotron4-340b-base",
local_dir_use_symlinks=False,
)
Prepare the Llama3-70B Checkpoint
You can download the Llama3-70B checkpoint from huggingface/meta-llama.
To download the checkpoint, you need to be approved by the Meta Llama3 Community License Agreement.
To download the model, use your Hugging Face API token.
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="meta-llama/Meta-Llama-3-70B",
local_dir="llama3-70b-base",
local_dir_use_symlinks=False,
token=<YOUR HF TOKEN>,
)
Convert the Llama3-70B HF checkpoint into .nemo format Run the container using the following command:
docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.07 bash
Convert the Hugging Face model to the .nemo model:
python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
--input_name_or_path=./llama3-70b-base \
--output_path=llama3-70b-base.nemo
Extract .nemo to a folder to avoid a memory issue while loading the model.
mkdir llama3-70b-base-nemo && tar -xvf llama3-70b-base.nemo -C llama3-70b-base-nemo
Convert NeMo Checkpoint to qnemo Format
“.nemo” versus “.qnemo”
NeMo provides a Post-Training Quantization that allows you to convert regular .nemo models into a TensorRT-LLM checkpoint. These checkpoints are conventionally referred to as .qnemo checkpoints in NeMo. You can use these .qnemo
checkpoints with the NVIDIA TensorRT-LLM library for efficient inference.
A .qnemo
checkpoint, similar to .nemo
checkpoints, is a tar file that includes the model configuration from the config.json
file and separate rank{i}.safetensors
files containing model weights for each rank. Additionally, it saves a tokenizer_config.yaml
file, which corresponds to the tokenizer
section in the original NeMo model’s model_config.yaml
. This configuration file defines the tokenizer used by the given model.
When exporting large models, it is recommended to use a directory instead of a tar file. You can control this behavior using the compress
flag in the PTQ configuration file.
Run Calibration to Generate the qnemo Model
Calibrate the Nemotron4-340B Model
Calibrating the Nemotron4-340B model requires at least two DGX-8H100 GPUs.
To submit a job through the NeMo Framework Launcher:
cd NeMo-Framework-Launcher/launcher_scripts
CALIB_PP=2
CALIB_TP=8
INFER_TP=8
python3 main.py \
ptq=model/quantization \
stages=["ptq"] \
launcher_scripts_path=$(pwd) \
base_results_dir=/results/base \
"container='${CONTAINER}'" \
container_mounts=[/models,/results] \
cluster.partition=${SLURM_PARTITION} \
cluster.account=${SLURM_ACCOUNT} \
cluster.job_name_prefix="${SLURM_ACCOUNT}-nemotron_340b_fp8:" \
cluster.gpus_per_task=null \
cluster.gpus_per_node=null \
cluster.srun_args='["--no-container-mount-home", "--mpi=pmix"]' \
ptq.run.model_train_name=nemotron_340b \
ptq.run.time_limit=45 \
ptq.run.results_dir=/results \
ptq.quantization.algorithm=fp8 \
ptq.export.decoder_type=gptnext \
ptq.export.inference_tensor_parallel=${INFER_TP} \
ptq.export.inference_pipeline_parallel=1 \
ptq.trainer.precision=bf16 \
ptq.model.restore_from_path=/models/nemotron4-340b-base \
ptq.export.save_path=/results/nemotron4-340b-base-fp8-qnemo \
ptq.model.tensor_model_parallel_size=${CALIB_TP} \
ptq.model.pipeline_model_parallel_size=${CALIB_PP}
Note
Cluster settings might differ depending on your hardware environment. Consult the NeMo Framework Launcher documentation for cluster-related settings.
Calibrate the Llama3-70B Model
Calibrating the Llama3 70B model requires at least eight H100 GPUs.
To submit the job through NeMo:
python /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_ptq.py \
model.restore_from_path=llama3-70b-base-nemo \
model.tensor_model_parallel_size=4 \
model.pipeline_model_parallel_size=1 \
trainer.num_nodes=1 \
trainer.precision=bf16 \
trainer.devices=8 \
quantization.algorithm=fp8 \
export.decoder_type=llama \
export.inference_tensor_parallel=2 \
export.save_path=llama3-70b-base-fp8-qnemo
Note
Ensure that you run the above scripts within the NeMo Docker container``nvcr.io/nvidia/nemo:24.07``.
In the case of Nemotron4-340B, the output directory stores the following files:
nemotron4-340b-base-fp8-qnemo
├── config.json
├── rank0.safetensors
├── rank1.safetensors
├── rank2.safetensors
├── rank3.safetensors
├── rank4.safetensors
├── rank5.safetensors
├── rank6.safetensors
├── rank7.safetensors
├── tokenizer.model
└── tokenizer_config.yaml
The output for Llama3-70B is similar.
Export to TensorRT-LLM
Two options are available to export the quantized model to TensorRT-LLM engine. You can use nemo.export
module or directly TensorRT-LLM trtllm-build
command.
Export to TensorRT-LLM through nemo.export
To build and run the TensorRT-LLM engine using the TensorRTLLM class in the nemo.export
submodule:
from nemo.export.tensorrt_llm import TensorRTLLM
# Export Nemotron4-340B model
trt_llm_exporter = TensorRTLLM(model_dir="nemotron4-340b-base-fp8-trt-llm-engine")
trt_llm_exporter.export(nemo_checkpoint_path="nemotron4-340b-base-fp8-qnemo")
trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])
# Export Llama3-70B model
trt_llm_exporter = TensorRTLLM(model_dir="llama3-70b-base-fp8-trt-llm-engine")
trt_llm_exporter.export(nemo_checkpoint_path="llama3-70b-base-fp8-qnemo")
trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])
Export to TensorRT-LLm using trtllm-build
To build directly using the trtllm-build command, refer to the TensorRT-LLM documentation for more information:
# Build Nemotron4-340B TRTLLM Engine
trtllm-build \
--checkpoint_dir nemotron4-340b-base-fp8-qnemo \
--output_dir nemotron4-340b-base-fp8-trt-llm-engine \
--max_batch_size 8 \
--max_input_len 2048 \
--max_output_len 512 \
--strongly_typed
The command for Llama3-70B is similar.
As an example, the TensorRT-LLM Engine files will be stored to nemotron4-340b-base-fp8-trt-llm-engine
and llama3-70b-base-fp8-trt-llm-engine
.
Deploy Nemotron/Llama TensorRT-LLM to Triton
To deploy a TensorRT-LLM model to Triton, use the APIs in the deploy module:
from nemo.export.tensorrt_llm import TensorRTLLM
from nemo.deploy import DeployPyTriton
# Deploy Nemotron Model
trt_llm_exporter = TensorRTLLM(model_dir="nemotron4-340b-base-fp8-trt-llm-engine")
nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="nemotron4-340b", port=8000)
nm.deploy()
nm.serve()
# Deploy Llama3 Model
trt_llm_exporter = TensorRTLLM(model_dir="llama3-70b-base-fp8-trt-llm-engine")
lm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="llama3-70b", port=8008)
lm.deploy()
lm.serve()
NeMo Framework provides NemoQueryLLM APIs to send a query to the Triton server for convenience. These APIs are only accessible from the NeMo Framework container.
To send a query to the Triton server:
from nemo.deploy.nlp import NemoQueryLLM
# Send a Query to nemotron server
nq = NemoQueryLLM(url="localhost:8000", model_name="nemotron4-340b")
output = nq.query_llm(
prompts=["What is the capital of United States?"],
max_output_len=10,
top_k=1,
top_p=0.0,
temperature=1.0,
)
print(output)
# Send a query to llama server
nq = NemoQueryLLM(url="localhost:8008", model_name="llama3-70b")
output = nq.query_llm(
prompts=["What is the capital of United States?"],
max_output_len=10,
top_k=1,
top_p=0.0,
temperature=1.0
)
print(output)
To learn more about various deployment methods, refer to the Deploy NeMo Framework Models documentation.