Customization#

Prerequisites#

  • Refer to the Support Matrix to make sure that you have the supported hardware and software stack.

  • An NGC personal API key. The NIM microservice uses the API key to download models from NVIDIA NGC. Refer to Generating a Personal API Key in the NVIDIA NGC User Guide for more information.

    When you create an NGC API personal key, select at least NGC Catalog from the Services Included menu. You can specify more services to use the key for additional purposes.

Model specific credentials#

To access FLUX.1-dev model read and accept FLUX.1-dev , FLUX.1-Canny-dev , FLUX.1-Depth-dev and FLUX.1-dev-onnx License Agreements and Acceptable Use Policy.

Create a new Hugging Face token with Read access to contents of all public gated repos you can access permission.

To access FLUX.1-schnell model read and accept FLUX.1-schnell and FLUX.1-schnell-onnx License Agreements and Acceptable Use Policy.

Create a new Hugging Face token with Read access to contents of all public gated repos you can access permission.

System requirements#

The customization requires higher minimal system requirements in comparison to inference:

Model

GPU Memory

RAM

OS

CPU

black-forest-labs/flux.1-dev

16GB

50GB

Linux/WSL2

x86_64

black-forest-labs/flux.1-schnell

16GB

50GB

Linux/WSL2

x86_64

About Customizing Models#

NVIDIA NIM for Visual Generative AI offers a range of customization options including specific models precisions for inference pipeline components and specific output images resolutions for the best performance.

Building an Optimized TensorRT Engine#

You can build an optimized engine to provide GPU-model specific optimizations for your host.

  1. Create the cache directory on the host machine.

       export LOCAL_NIM_CACHE=~/.cache/nim
       mkdir -p "$LOCAL_NIM_CACHE"
       chmod 1777 $LOCAL_NIM_CACHE
    
  2. Create a directory to store the optimized engine and update the permissions:

      export OUTPUT_DIR=exported_model_dir
      mkdir -p "$OUTPUT_DIR"
      chmod 1777 $OUTPUT_DIR
    
  3. Build the optimized engine for your GPU model and host:

    docker run -it --rm \
     --runtime=nvidia \
     --gpus '"device=0"' \
     -e NGC_API_KEY \
     -e HF_TOKEN=$HF_TOKEN \
     -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
     -v $(pwd)/$OUTPUT_DIR:/output_dir \
     --entrypoint "python3" \
     nvcr.io/nim/black-forest-labs/flux.1-dev:1.0.1 \
     optimize.py --gpu ${your_gpu_name} --low-vram --export-path /output_dir
    
    podman run -it --rm \
      --device nvidia.com/gpu=all \
      -e NGC_API_KEY=$NGC_API_KEY \
      -e HF_TOKEN=$HF_TOKEN \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/output_dir \
      --entrypoint "python3" \
      nvcr.io/nim/black-forest-labs/flux.1-dev:1.0.1 \
      optimize.py --gpu ${your_gpu_name} --low-vram --export-path /output_dir
    

    Refer to Support Matrix in terms of precision specification using --fp4, --fp8 and --build-t5-fp8 flags.

    docker run -it --rm \
     --runtime=nvidia \
     --gpus '"device=0"' \
     -e NGC_API_KEY \
     -e HF_TOKEN=$HF_TOKEN \
     -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
     -v $(pwd)/$OUTPUT_DIR:/output_dir \
     --entrypoint "python3" \
     nvcr.io/nim/black-forest-labs/flux.1-schnell:1.0.0 \
     optimize.py --gpu ${your_gpu_name} --low-vram --export-path /output_dir
    
    podman run -it --rm \
      --schnellice nvidia.com/gpu=all \
      -e NGC_API_KEY=$NGC_API_KEY \
      -e HF_TOKEN=$HF_TOKEN \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/output_dir \
      --entrypoint "python3" \
      nvcr.io/nim/black-forest-labs/flux.1-schnell:1.0.0 \
      optimize.py --gpu ${your_gpu_name} --low-vram --export-path /output_dir
    

    Refer to Support Matrix in terms of precision specification using --fp4, --fp8 and --build-t5-fp8 flags.

    The optimize.py script creates the following directories and files for the engine:

    $OUTPUT_DIR
    ├── metadata.json        <- file with metadata needed to run the NIM
    ├── trt_engines_dir      <- directory with optimized trt engines
    ├── framework_model_dir  <- directory with configuration files for the model (e.g., diffusion scheduler config)
    └── manifest.yaml        <- manifest file with the generated optimized profile that could be used for the default manifest overriding
    └── memory_profile.yaml  <- memory profile including the VRAM, SRAM and Buffer usage for each pipeline stage used in offloading policy selection
    
  4. Start the container with the optimized engine directory and manifest:

    docker run -it --rm --name=nim-server \
     --runtime=nvidia \
     --gpus '"device=0"' \
     -e NGC_API_KEY \
     -e HF_TOKEN=$HF_TOKEN \
     -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \
      -p 8000:8000 \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \
      nvcr.io/nim/black-forest-labs/flux.1-dev:1.0.1
    
    podman run -it --rm --name=nim-server \
      --device nvidia.com/gpu=all \
      -e NGC_API_KEY=$NGC_API_KEY \
      -e HF_TOKEN=$HF_TOKEN \
      -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \
      -p 8000:8000 \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \
      nvcr.io/nim/black-forest-labs/flux.1-dev:1.0.1
    
    docker run -it --rm --name=nim-server \
     --runtime=nvidia \
     --gpus '"device=0"' \
     -e NGC_API_KEY \
     -e HF_TOKEN=$HF_TOKEN \
     -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \
      -p 8000:8000 \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \
      nvcr.io/nim/black-forest-labs/flux.1-schnell:1.0.0
    
    podman run -it --rm --name=nim-server \
      --device nvidia.com/gpu=all \
      -e NGC_API_KEY=$NGC_API_KEY \
      -e HF_TOKEN=$HF_TOKEN \
      -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \
      -p 8000:8000 \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \
      nvcr.io/nim/black-forest-labs/flux.1-schnell:1.0.0
    

Parameters for the Container#

Flags

Description

-it

--interactive + --tty (see Docker docs)

--rm

Delete the container after it stops (see Docker docs)

--name=<container-name

Give a name to the NIM container. Use any preferred value

--runtime=nvidia

Ensure NVIDIA drivers are accessible in the container

--gpus '"device=0"'

Expose NVIDIA GPU 0 inside the container. If you are running on a host with multiple GPUs, you need to specify which GPU to use. See GPU Enumeration for further information on for mounting specific GPUs

-e NGC_API_KEY=$NGC_API_KEY

Provide the container with the token necessary to download adequate models and resources from NGC

-v $(pwd)/$OUTPUT_DIR:/output_dir

Mount the local $(pwd)/$OUTPUT_DIR to the /output_dir directory inside the container

--entrypoint "python3"

Change the default entrypoint that starts NIM server to the python3 to run the optimization script

optimize.py --gpu ${your_gpu_name} --export-path /output_dir

Call of the optimization script with 2 required parameters

Parameters for the Optimization Script#

Parameter

Default Value

Description

--export-path EXPORT_PATH

Required

Path to the optimization output directory where TRT engines would be saved

--gpu GPU

Required

used GPU model

--height HEIGHT

1024

the optimal height of the generated images. Supported values {512,576,640,704,768,832,896,960,1024,1088,1152,1216,1280,1344}

--width WIDTH

1024

the optimal width of the generated images. Supported values {512,576,640,704,768,832,896,960,1024,1088,1152,1216,1280,1344}

--min-height MIN_HEIGHT

HEIGHT

the minimum height of generated images if not specified the value of –height will be used. Supported values {512,576,640,704,768,832,896,960,1024,1088,1152,1216,1280,1344}

--max-height MAX_HEIGHT

HEIGHT

the maximum height of generated images if not specified the value of –height will be used. Supported values {512,576,640,704,768,832,896,960,1024,1088,1152,1216,1280,1344}

--min-width MIN_WIDTH

WIDTH

the minimum width of generated images if not specified the value of –width will be used. Supported values {512,576,640,704,768,832,896,960,1024,1088,1152,1216,1280,1344}

--max-width MAX_WIDTH

WIDTH

the maximum width of generated images if not specified the value of –width will be used. Supported values {512,576,640,704,768,832,896,960,1024,1088,1152,1216,1280,1344}

--fp4

Use fp4 checkpoint. Available only for GPU Compute Compatibility >= 10.0 (Blackwell)

--fp8

Use fp8 checkpoint. Available only for GPU Compute Compatibility >= 8.9 (Ada)

--build-t5-fp8

Uses fp8 T5 model checkpoint instead of bf16 one. Independent from –fp4 and –fp8 flags. Available only for GPU Compute Compatibility >= 8.9 (Ada)

--t5-ws-percentage

None

The percentage of the T5 weights that should be streamed from host to device. Useful for reducing device memory usage. Value should be between 0 and 100. Only supported for black-forest-labs/flux.1-schnell.

--profile-repository <repository-name>

local:///opt/nim/local

The information that would be used in NIM manifest to start the NIM server. local:// means that NIM will use the local files only /opt/nim/local the path inside the container where the profile data would be mounted

--silent-mode

Disables TRT optimization logs

--low-vram

DEPRECATED: Offloading policy is automatically selected based on the memory profile

--no-perf-measurements

Disables e2e pipeline run after the engines are build

--force-rebuild

Enforce building new engines buy removing the old ones

--no-memory-profile

Disables memory profile generation