Customization#

Prerequisites#

  • Refer to the Support Matrix to make sure that you have the supported hardware and software stack.

  • An NGC personal API key. The NIM microservice uses the API key to download models from NVIDIA NGC. Refer to Generating a Personal API Key in the NVIDIA NGC User Guide for more information.

    When you create an NGC API personal key, select at least NGC Catalog from the Services Included menu. You can specify more services to use the key for additional purposes.

Model specific credentials#

To access FLUX.1-dev model read and accept FLUX.1-dev , FLUX.1-Canny-dev , FLUX.1-Depth-dev and FLUX.1-dev-onnx License Agreements and Acceptable Use Policy.

Create a new Hugging Face token with Read access to contents of all public gated repos you can access permission.

To access FLUX.1-schnell model read and accept FLUX.1-schnell and FLUX.1-schnell-onnx License Agreements and Acceptable Use Policy.

Create a new Hugging Face token with Read access to contents of all public gated repos you can access permission.

To access FLUX.1-Kontext-dev model read and accept FLUX.1-Kontext-dev and FLUX.1-Kontext-dev-onnx License Agreements and Acceptable Use Policy.

Create a new Hugging Face token with Read access to contents of all public gated repos you can access permission.

To access Stable Diffusion 3.5 Large model read and accept Stable Diffusion 3.5 Large, Stable Diffusion 3.5 Large TensorRT and Stable Diffusion 3.5 Large ControlNet TensorRT License Agreements and Acceptable Use Policy.

Create a new Hugging Face token with Read access to contents of all public gated repos you can access permission.

System requirements#

The customization requires higher minimal system requirements in comparison to inference:

Model

GPU Memory

RAM

OS

CPU

black-forest-labs/flux.1-dev

16GB

50GB

Linux/WSL2

x86_64

black-forest-labs/flux.1-schnell

16GB

50GB

Linux/WSL2

x86_64

black-forest-labs/flux.1-kontext-dev

16GB

50GB

Linux/WSL2

x86_64

stabilityai/stable-diffusion-3.5-large

32GB

50GB

Linux

x86_64

About Customizing Models#

NVIDIA NIM for Visual Generative AI offers a range of customization options including specific models precisions for inference pipeline components and specific output images resolutions for the best performance.

Building an Optimized TensorRT Engine#

You can build an optimized engine to provide GPU-model specific optimizations for your host.

  1. Create the cache directory on the host machine.

       export LOCAL_NIM_CACHE=~/.cache/nim
       mkdir -p "$LOCAL_NIM_CACHE"
       chmod 1777 $LOCAL_NIM_CACHE
    
  2. Create a directory to store the optimized engine and update the permissions:

      export OUTPUT_DIR=exported_model_dir
      mkdir -p "$OUTPUT_DIR"
      chmod 1777 $OUTPUT_DIR
    
  3. Build the optimized engine for your GPU model and host:

    docker run -it --rm \
     --runtime=nvidia \
     --gpus '"device=0"' \
     -e NGC_API_KEY \
     -e HF_TOKEN=$HF_TOKEN \
     -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
     -v $(pwd)/$OUTPUT_DIR:/output_dir \
     --entrypoint "python3" \
     nvcr.io/nim/black-forest-labs/flux.1-dev:1.1.0 \
     optimize.py --gpu ${your_gpu_name} --export-path /output_dir
    
    podman run -it --rm \
      --device nvidia.com/gpu=all \
      -e NGC_API_KEY=$NGC_API_KEY \
      -e HF_TOKEN=$HF_TOKEN \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/output_dir \
      --entrypoint "python3" \
      nvcr.io/nim/black-forest-labs/flux.1-dev:1.1.0 \
      optimize.py --gpu ${your_gpu_name} --export-path /output_dir
    

    Refer to Support Matrix in terms of precision specification using --fp4, --fp8 and --build-t5-fp8 flags.

    docker run -it --rm \
     --runtime=nvidia \
     --gpus '"device=0"' \
     -e NGC_API_KEY \
     -e HF_TOKEN=$HF_TOKEN \
     -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
     -v $(pwd)/$OUTPUT_DIR:/output_dir \
     --entrypoint "python3" \
     nvcr.io/nim/black-forest-labs/flux.1-schnell:1.0.0 \
     optimize.py --gpu ${your_gpu_name} --export-path /output_dir
    
    podman run -it --rm \
      --device nvidia.com/gpu=all \
      -e NGC_API_KEY=$NGC_API_KEY \
      -e HF_TOKEN=$HF_TOKEN \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/output_dir \
      --entrypoint "python3" \
      nvcr.io/nim/black-forest-labs/flux.1-schnell:1.0.0 \
      optimize.py --gpu ${your_gpu_name} --export-path /output_dir
    

    Refer to Support Matrix in terms of precision specification using --fp4, --fp8 and --build-t5-fp8 flags.

    docker run -it --rm \
     --runtime=nvidia \
     --gpus '"device=0"' \
     -e NGC_API_KEY \
     -e HF_TOKEN=$HF_TOKEN \
     -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
     -v $(pwd)/$OUTPUT_DIR:/output_dir \
     --entrypoint "python3" \
     nvcr.io/nim/black-forest-labs/flux.1-kontext-dev:1.0.0 \
     optimize.py --gpu ${your_gpu_name} --export-path /output_dir
    
    podman run -it --rm \
      --device nvidia.com/gpu=all \
      -e NGC_API_KEY=$NGC_API_KEY \
      -e HF_TOKEN=$HF_TOKEN \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/output_dir \
      --entrypoint "python3" \
      nvcr.io/nim/black-forest-labs/flux.1-kontext-dev:1.0.0 \
      optimize.py --gpu ${your_gpu_name} --export-path /output_dir
    

    Refer to Support Matrix in terms of precision specification using --fp4, --fp8 and --build-t5-fp8 flags.

    docker run -it --rm \
     --runtime=nvidia \
     --gpus '"device=0"' \
     -e NGC_API_KEY \
     -e HF_TOKEN=$HF_TOKEN \
     -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
     -v $(pwd)/$OUTPUT_DIR:/output_dir \
     --entrypoint "python3" \
     nvcr.io/nim/stabilityai/stable-diffusion-3.5-large:1.0.0 \
     optimize.py --gpu ${your_gpu_name} --low-vram --export-path /output_dir
    
    podman run -it --rm \
      --device nvidia.com/gpu=all \
      -e NGC_API_KEY=$NGC_API_KEY \
      -e HF_TOKEN=$HF_TOKEN \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/output_dir \
      --entrypoint "python3" \
      nvcr.io/nim/stabilityai/stable-diffusion-3.5-large:1.0.0 \
      optimize.py --gpu ${your_gpu_name} --export-path /output_dir
    

    The optimize.py script creates the following directories and files for the engine:

    $OUTPUT_DIR
    ├── metadata.json        <- file with metadata needed to run the NIM
    ├── trt_engines_dir      <- directory with optimized trt engines
    ├── framework_model_dir  <- directory with configuration files for the model (e.g., diffusion scheduler config)
    └── manifest.yaml        <- manifest file with the generated optimized profile that could be used for the default manifest overriding
    └── memory_profile.yaml  <- memory profile including the VRAM, SRAM and Buffer usage for each pipeline stage used in offloading policy selection
    
  4. Start the container with the optimized engine directory and manifest:

    docker run -it --rm --name=nim-server \
     --runtime=nvidia \
     --gpus '"device=0"' \
     -e NGC_API_KEY \
     -e HF_TOKEN=$HF_TOKEN \
     -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \
      -p 8000:8000 \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \
      nvcr.io/nim/black-forest-labs/flux.1-dev:1.1.0
    
    podman run -it --rm --name=nim-server \
      --device nvidia.com/gpu=all \
      -e NGC_API_KEY=$NGC_API_KEY \
      -e HF_TOKEN=$HF_TOKEN \
      -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \
      -p 8000:8000 \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \
      nvcr.io/nim/black-forest-labs/flux.1-dev:1.1.0
    
    docker run -it --rm --name=nim-server \
     --runtime=nvidia \
     --gpus '"device=0"' \
     -e NGC_API_KEY \
     -e HF_TOKEN=$HF_TOKEN \
     -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \
      -p 8000:8000 \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \
      nvcr.io/nim/black-forest-labs/flux.1-schnell:1.0.0
    
    podman run -it --rm --name=nim-server \
      --device nvidia.com/gpu=all \
      -e NGC_API_KEY=$NGC_API_KEY \
      -e HF_TOKEN=$HF_TOKEN \
      -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \
      -p 8000:8000 \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \
      nvcr.io/nim/black-forest-labs/flux.1-schnell:1.0.0
    
    docker run -it --rm --name=nim-server \
     --runtime=nvidia \
     --gpus '"device=0"' \
     -e NGC_API_KEY \
     -e HF_TOKEN=$HF_TOKEN \
     -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \
      -p 8000:8000 \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \
      nvcr.io/nim/black-forest-labs/flux.1-kontext-dev:1.0.0
    
    podman run -it --rm --name=nim-server \
      --device nvidia.com/gpu=all \
      -e NGC_API_KEY=$NGC_API_KEY \
      -e HF_TOKEN=$HF_TOKEN \
      -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \
      -p 8000:8000 \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \
      nvcr.io/nim/black-forest-labs/flux.1-kontext-dev:1.0.0
    
    docker run -it --rm --name=nim-server \
     --runtime=nvidia \
     --gpus '"device=0"' \
     -e NGC_API_KEY \
     -e HF_TOKEN=$HF_TOKEN \
     -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \
      -p 8000:8000 \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \
      nvcr.io/nim/stabilityai/stable-diffusion-3.5-large:1.0.0
    
    podman run -it --rm --name=nim-server \
      --device nvidia.com/gpu=all \
      -e NGC_API_KEY=$NGC_API_KEY \
      -e HF_TOKEN=$HF_TOKEN \
      -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \
      -p 8000:8000 \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \
      nvcr.io/nim/stabilityai/stable-diffusion-3.5-large:1.0.0
    

Parameters for the Container#

Flags

Description

-it

--interactive + --tty (see Docker docs)

--rm

Delete the container after it stops (see Docker docs)

--name=<container-name

Give a name to the NIM container. Use any preferred value

--runtime=nvidia

Ensure NVIDIA drivers are accessible in the container

--gpus '"device=0"'

Expose NVIDIA GPU 0 inside the container. If you are running on a host with multiple GPUs, you need to specify which GPU to use. See GPU Enumeration for further information on for mounting specific GPUs

-e NGC_API_KEY=$NGC_API_KEY

Provide the container with the token necessary to download adequate models and resources from NGC

-v $(pwd)/$OUTPUT_DIR:/output_dir

Mount the local $(pwd)/$OUTPUT_DIR to the /output_dir directory inside the container

--entrypoint "python3"

Change the default entrypoint that starts NIM server to the python3 to run the optimization script

optimize.py --gpu ${your_gpu_name} --export-path /output_dir

Call of the optimization script with 2 required parameters

Parameters for the Optimization Script#

Parameter

Default Value

Description

--export-path EXPORT_PATH

Required

The path to the optimization output directory where TRT engines are saved.

--gpu GPU

Required

The GPU model the system uses.

--height HEIGHT

1024

The optimal height for generated images. Supported values include {512, 576, 640, 704, 768, 832, 896, 960, 1024, 1088, 1152, 1216, 1280, and 1344}. For Flux.1-Kontext-dev, supported values include {672, 688, 720, 752, 800, 832, 880, 944, 1024, 1104, 1184, 1248, 1328, 1392, 1456, 1504, and 1568}.

--width WIDTH

1024

The optimal width of generated images. Supported values include {512, 576, 640, 704, 768, 832, 896, 960, 1024, 1088, 1152, 1216, 1280, and 1344}. For Flux.1-Kontext-dev, supported values include {672, 688, 720, 752, 800, 832, 880, 944, 1024, 1104, 1184, 1248, 1328, 1392, 1456, 1504, and 1568}.

--min-height MIN_HEIGHT

HEIGHT

The minimum height for generated images. If not specified, the system applies the value of –height. Supported values include {512, 576, 640, 704, 768, 832, 896, 960, 1024, 1088, 1152, 1216, 1280, and 1344}. For Flux.1-Kontext-dev, supported values include {672, 688, 720, 752, 800, 832, 880, 944, 1024, 1104, 1184, 1248, 1328, 1392, 1456, 1504, and 1568}.

--max-height MAX_HEIGHT

HEIGHT

The maximum height for generated images. If not specified, the system applies the –height value. Supported values include {512, 576, 640, 704, 768, 832, 896, 960, 1024, 1088, 1152, 1216, 1280, and 1344}. For Flux.1-Kontext-dev, supported values include {672, 688, 720, 752, 800, 832, 880, 944, 1024, 1104, 1184, 1248, 1328, 1392, 1456, 1504, and 1568}.

--min-width MIN_WIDTH

WIDTH

The minimum width for generated images. If omitted, the system uses the –width value. Supported values include {512, 576, 640, 704, 768, 832, 896, 960, 1024, 1088, 1152, 1216, 1280, and 1344}. For Flux.1-Kontext-dev, supported values include {672, 688, 720, 752, 800, 832, 880, 944, 1024, 1104, 1184, 1248, 1328, 1392, 1456, 1504, and 1568}.

--max-width MAX_WIDTH

WIDTH

The maximum width for generated images. If not specified, the system uses the –width value. Supported values include {512, 576, 640, 704, 768, 832, 896, 960, 1024, 1088, 1152, 1216, 1280, and 1344}. For Flux.1-Kontext-dev, supported values include {672, 688, 720, 752, 800, 832, 880, 944, 1024, 1104, 1184, 1248, 1328, 1392, 1456, 1504, and 1568}.

--variant VARIANT_1 VARIANT2 ...

base

A set of supported model variants (see Support Matrix). To specify multiple variants, use a space-separated list.

--fp4

Use the FP4 checkpoint. Available only for GPU compute capability 10.0 or higher (Blackwell).

--fp8

Use the FP8 checkpoint. Available only for GPU compute capability 8.9 or higher (Ada).

--build-t5-fp8

Uses the FP8 T5 model checkpoint instead of the BF16 checkpoint. It runs independently of the –fp4 and –fp8 flags. Available only on GPUs with Compute Compatibility 8.9 or higher (Ada).

--t5-ws-percentage

None

The percentage of T5 weights to stream from host to device to reduce device memory usage. Accept values between 0 and 100. Supported only for black-forest-labs/flux.1-schnell and black-forest-labs/flux.1-kontext-dev.

--transformer-ws-percentage

None

The percentage of Transformer (diffusion denoiser) weights to stream from host to device to reduce device memory usage. Use a value between 0 and 100. Supported only for black-forest-labs/flux.1-kontext-dev.

--profile-repository <repository-name>

local:///opt/nim/local

The NIM manifest entries required to start the NIM server. The local:// prefix instructs NIM to use only local files. Use /opt/nim/local as the container path where NIM mounts the profile data.

--silent-mode

Disable TRT optimization logs.

--low-vram

DEPRECATED: The system automatically selects the offloading policy based on the memory profile.

--no-perf-measurements

Disable the e2e pipeline run after building the engines.

--force-rebuild

Enforce building new engines by removing existing ones.

--no-memory-profile

Disable memory profile generation.