Customization#
Prerequisites#
Refer to the Support Matrix to make sure that you have the supported hardware and software stack.
An NGC personal API key. The NIM microservice uses the API key to download models from NVIDIA NGC. Refer to Generating a Personal API Key in the NVIDIA NGC User Guide for more information.
When you create an NGC API personal key, select at least NGC Catalog from the Services Included menu. You can specify more services to use the key for additional purposes.
Model specific credentials#
To access FLUX.1-dev model read and accept FLUX.1-dev , FLUX.1-Canny-dev , FLUX.1-Depth-dev and FLUX.1-dev-onnx License Agreements and Acceptable Use Policy.
Create a new Hugging Face token with Read access to contents of all public gated repos you can access permission.
To access FLUX.1-schnell model read and accept FLUX.1-schnell and FLUX.1-schnell-onnx License Agreements and Acceptable Use Policy.
Create a new Hugging Face token with Read access to contents of all public gated repos you can access permission.
To access FLUX.1-Kontext-dev model read and accept FLUX.1-Kontext-dev and FLUX.1-Kontext-dev-onnx License Agreements and Acceptable Use Policy.
Create a new Hugging Face token with Read access to contents of all public gated repos you can access permission.
To access Stable Diffusion 3.5 Large model read and accept Stable Diffusion 3.5 Large, Stable Diffusion 3.5 Large TensorRT and Stable Diffusion 3.5 Large ControlNet TensorRT License Agreements and Acceptable Use Policy.
Create a new Hugging Face token with Read access to contents of all public gated repos you can access permission.
System requirements#
The customization requires higher minimal system requirements in comparison to inference:
Model |
GPU Memory |
RAM |
OS |
CPU |
---|---|---|---|---|
black-forest-labs/flux.1-dev |
16GB |
50GB |
Linux/WSL2 |
x86_64 |
black-forest-labs/flux.1-schnell |
16GB |
50GB |
Linux/WSL2 |
x86_64 |
black-forest-labs/flux.1-kontext-dev |
16GB |
50GB |
Linux/WSL2 |
x86_64 |
stabilityai/stable-diffusion-3.5-large |
32GB |
50GB |
Linux |
x86_64 |
About Customizing Models#
NVIDIA NIM for Visual Generative AI offers a range of customization options including specific models precisions for inference pipeline components and specific output images resolutions for the best performance.
Building an Optimized TensorRT Engine#
You can build an optimized engine to provide GPU-model specific optimizations for your host.
Create the cache directory on the host machine.
export LOCAL_NIM_CACHE=~/.cache/nim mkdir -p "$LOCAL_NIM_CACHE" chmod 1777 $LOCAL_NIM_CACHE
Create a directory to store the optimized engine and update the permissions:
export OUTPUT_DIR=exported_model_dir mkdir -p "$OUTPUT_DIR" chmod 1777 $OUTPUT_DIR
Build the optimized engine for your GPU model and host:
docker run -it --rm \ --runtime=nvidia \ --gpus '"device=0"' \ -e NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/output_dir \ --entrypoint "python3" \ nvcr.io/nim/black-forest-labs/flux.1-dev:1.1.0 \ optimize.py --gpu ${your_gpu_name} --export-path /output_dir
podman run -it --rm \ --device nvidia.com/gpu=all \ -e NGC_API_KEY=$NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/output_dir \ --entrypoint "python3" \ nvcr.io/nim/black-forest-labs/flux.1-dev:1.1.0 \ optimize.py --gpu ${your_gpu_name} --export-path /output_dir
Refer to Support Matrix in terms of precision specification using
--fp4
,--fp8
and--build-t5-fp8
flags.docker run -it --rm \ --runtime=nvidia \ --gpus '"device=0"' \ -e NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/output_dir \ --entrypoint "python3" \ nvcr.io/nim/black-forest-labs/flux.1-schnell:1.0.0 \ optimize.py --gpu ${your_gpu_name} --export-path /output_dir
podman run -it --rm \ --device nvidia.com/gpu=all \ -e NGC_API_KEY=$NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/output_dir \ --entrypoint "python3" \ nvcr.io/nim/black-forest-labs/flux.1-schnell:1.0.0 \ optimize.py --gpu ${your_gpu_name} --export-path /output_dir
Refer to Support Matrix in terms of precision specification using
--fp4
,--fp8
and--build-t5-fp8
flags.docker run -it --rm \ --runtime=nvidia \ --gpus '"device=0"' \ -e NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/output_dir \ --entrypoint "python3" \ nvcr.io/nim/black-forest-labs/flux.1-kontext-dev:1.0.0 \ optimize.py --gpu ${your_gpu_name} --export-path /output_dir
podman run -it --rm \ --device nvidia.com/gpu=all \ -e NGC_API_KEY=$NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/output_dir \ --entrypoint "python3" \ nvcr.io/nim/black-forest-labs/flux.1-kontext-dev:1.0.0 \ optimize.py --gpu ${your_gpu_name} --export-path /output_dir
Refer to Support Matrix in terms of precision specification using
--fp4
,--fp8
and--build-t5-fp8
flags.docker run -it --rm \ --runtime=nvidia \ --gpus '"device=0"' \ -e NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/output_dir \ --entrypoint "python3" \ nvcr.io/nim/stabilityai/stable-diffusion-3.5-large:1.0.0 \ optimize.py --gpu ${your_gpu_name} --low-vram --export-path /output_dir
podman run -it --rm \ --device nvidia.com/gpu=all \ -e NGC_API_KEY=$NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/output_dir \ --entrypoint "python3" \ nvcr.io/nim/stabilityai/stable-diffusion-3.5-large:1.0.0 \ optimize.py --gpu ${your_gpu_name} --export-path /output_dir
The
optimize.py
script creates the following directories and files for the engine:$OUTPUT_DIR ├── metadata.json <- file with metadata needed to run the NIM ├── trt_engines_dir <- directory with optimized trt engines ├── framework_model_dir <- directory with configuration files for the model (e.g., diffusion scheduler config) └── manifest.yaml <- manifest file with the generated optimized profile that could be used for the default manifest overriding └── memory_profile.yaml <- memory profile including the VRAM, SRAM and Buffer usage for each pipeline stage used in offloading policy selection
Start the container with the optimized engine directory and manifest:
docker run -it --rm --name=nim-server \ --runtime=nvidia \ --gpus '"device=0"' \ -e NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \ -p 8000:8000 \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \ nvcr.io/nim/black-forest-labs/flux.1-dev:1.1.0
podman run -it --rm --name=nim-server \ --device nvidia.com/gpu=all \ -e NGC_API_KEY=$NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \ -p 8000:8000 \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \ nvcr.io/nim/black-forest-labs/flux.1-dev:1.1.0
docker run -it --rm --name=nim-server \ --runtime=nvidia \ --gpus '"device=0"' \ -e NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \ -p 8000:8000 \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \ nvcr.io/nim/black-forest-labs/flux.1-schnell:1.0.0
podman run -it --rm --name=nim-server \ --device nvidia.com/gpu=all \ -e NGC_API_KEY=$NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \ -p 8000:8000 \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \ nvcr.io/nim/black-forest-labs/flux.1-schnell:1.0.0
docker run -it --rm --name=nim-server \ --runtime=nvidia \ --gpus '"device=0"' \ -e NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \ -p 8000:8000 \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \ nvcr.io/nim/black-forest-labs/flux.1-kontext-dev:1.0.0
podman run -it --rm --name=nim-server \ --device nvidia.com/gpu=all \ -e NGC_API_KEY=$NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \ -p 8000:8000 \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \ nvcr.io/nim/black-forest-labs/flux.1-kontext-dev:1.0.0
docker run -it --rm --name=nim-server \ --runtime=nvidia \ --gpus '"device=0"' \ -e NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \ -p 8000:8000 \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \ nvcr.io/nim/stabilityai/stable-diffusion-3.5-large:1.0.0
podman run -it --rm --name=nim-server \ --device nvidia.com/gpu=all \ -e NGC_API_KEY=$NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \ -p 8000:8000 \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \ nvcr.io/nim/stabilityai/stable-diffusion-3.5-large:1.0.0
Parameters for the Container#
Flags |
Description |
---|---|
|
|
|
Delete the container after it stops (see Docker docs) |
|
Give a name to the NIM container. Use any preferred value |
|
Ensure NVIDIA drivers are accessible in the container |
|
Expose NVIDIA GPU 0 inside the container. If you are running on a host with multiple GPUs, you need to specify which GPU to use. See GPU Enumeration for further information on for mounting specific GPUs |
|
Provide the container with the token necessary to download adequate models and resources from NGC |
|
Mount the local |
|
Change the default entrypoint that starts NIM server to the |
|
Call of the optimization script with 2 required parameters |
Parameters for the Optimization Script#
Parameter |
Default Value |
Description |
---|---|---|
|
Required |
The path to the optimization output directory where TRT engines are saved. |
|
Required |
The GPU model the system uses. |
|
1024 |
The optimal height for generated images. Supported values include {512, 576, 640, 704, 768, 832, 896, 960, 1024, 1088, 1152, 1216, 1280, and 1344}. For Flux.1-Kontext-dev, supported values include {672, 688, 720, 752, 800, 832, 880, 944, 1024, 1104, 1184, 1248, 1328, 1392, 1456, 1504, and 1568}. |
|
1024 |
The optimal width of generated images. Supported values include {512, 576, 640, 704, 768, 832, 896, 960, 1024, 1088, 1152, 1216, 1280, and 1344}. For Flux.1-Kontext-dev, supported values include {672, 688, 720, 752, 800, 832, 880, 944, 1024, 1104, 1184, 1248, 1328, 1392, 1456, 1504, and 1568}. |
|
HEIGHT |
The minimum height for generated images. If not specified, the system applies the value of –height. Supported values include {512, 576, 640, 704, 768, 832, 896, 960, 1024, 1088, 1152, 1216, 1280, and 1344}. For Flux.1-Kontext-dev, supported values include {672, 688, 720, 752, 800, 832, 880, 944, 1024, 1104, 1184, 1248, 1328, 1392, 1456, 1504, and 1568}. |
|
HEIGHT |
The maximum height for generated images. If not specified, the system applies the –height value. Supported values include {512, 576, 640, 704, 768, 832, 896, 960, 1024, 1088, 1152, 1216, 1280, and 1344}. For Flux.1-Kontext-dev, supported values include {672, 688, 720, 752, 800, 832, 880, 944, 1024, 1104, 1184, 1248, 1328, 1392, 1456, 1504, and 1568}. |
|
WIDTH |
The minimum width for generated images. If omitted, the system uses the –width value. Supported values include {512, 576, 640, 704, 768, 832, 896, 960, 1024, 1088, 1152, 1216, 1280, and 1344}. For Flux.1-Kontext-dev, supported values include {672, 688, 720, 752, 800, 832, 880, 944, 1024, 1104, 1184, 1248, 1328, 1392, 1456, 1504, and 1568}. |
|
WIDTH |
The maximum width for generated images. If not specified, the system uses the –width value. Supported values include {512, 576, 640, 704, 768, 832, 896, 960, 1024, 1088, 1152, 1216, 1280, and 1344}. For Flux.1-Kontext-dev, supported values include {672, 688, 720, 752, 800, 832, 880, 944, 1024, 1104, 1184, 1248, 1328, 1392, 1456, 1504, and 1568}. |
|
base |
A set of supported model variants (see Support Matrix). To specify multiple variants, use a space-separated list. |
|
Use the FP4 checkpoint. Available only for GPU compute capability 10.0 or higher (Blackwell). |
|
|
Use the FP8 checkpoint. Available only for GPU compute capability 8.9 or higher (Ada). |
|
|
Uses the FP8 T5 model checkpoint instead of the BF16 checkpoint. It runs independently of the –fp4 and –fp8 flags. Available only on GPUs with Compute Compatibility 8.9 or higher (Ada). |
|
|
None |
The percentage of T5 weights to stream from host to device to reduce device memory usage. Accept values between 0 and 100. Supported only for |
|
None |
The percentage of Transformer (diffusion denoiser) weights to stream from host to device to reduce device memory usage. Use a value between 0 and 100. Supported only for |
|
local:///opt/nim/local |
The NIM manifest entries required to start the NIM server. The |
|
Disable TRT optimization logs. |
|
|
DEPRECATED: The system automatically selects the offloading policy based on the memory profile. |
|
|
Disable the e2e pipeline run after building the engines. |
|
|
Enforce building new engines by removing existing ones. |
|
|
Disable memory profile generation. |