Customization#
Prerequisites#
Refer to the Support Matrix to make sure that you have the supported hardware and software stack.
An NGC personal API key. The NIM microservice uses the API key to download models from NVIDIA NGC. Refer to Generating a Personal API Key in the NVIDIA NGC User Guide for more information.
When you create an NGC API personal key, select at least NGC Catalog from the Services Included menu. You can specify more services to use the key for additional purposes.
Model specific credentials#
To access FLUX.1-dev model read and accept FLUX.1-dev , FLUX.1-Canny-dev , FLUX.1-Depth-dev and FLUX.1-dev-onnx License Agreements and Acceptable Use Policy.
Create a new Hugging Face token with Read access to contents of all public gated repos you can access permission.
To access FLUX.1-schnell model read and accept FLUX.1-schnell and FLUX.1-schnell-onnx License Agreements and Acceptable Use Policy.
Create a new Hugging Face token with Read access to contents of all public gated repos you can access permission.
System requirements#
The customization requires higher minimal system requirements in comparison to inference:
Model |
GPU Memory |
RAM |
OS |
CPU |
---|---|---|---|---|
black-forest-labs/flux.1-dev |
16GB |
50GB |
Linux/WSL2 |
x86_64 |
black-forest-labs/flux.1-schnell |
16GB |
50GB |
Linux/WSL2 |
x86_64 |
About Customizing Models#
NVIDIA NIM for Visual Generative AI offers a range of customization options including specific models precisions for inference pipeline components and specific output images resolutions for the best performance.
Building an Optimized TensorRT Engine#
You can build an optimized engine to provide GPU-model specific optimizations for your host.
Create the cache directory on the host machine.
export LOCAL_NIM_CACHE=~/.cache/nim mkdir -p "$LOCAL_NIM_CACHE" chmod 1777 $LOCAL_NIM_CACHE
Create a directory to store the optimized engine and update the permissions:
export OUTPUT_DIR=exported_model_dir mkdir -p "$OUTPUT_DIR" chmod 1777 $OUTPUT_DIR
Build the optimized engine for your GPU model and host:
docker run -it --rm \ --runtime=nvidia \ --gpus '"device=0"' \ -e NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/output_dir \ --entrypoint "python3" \ nvcr.io/nim/black-forest-labs/flux.1-dev:1.0.1 \ optimize.py --gpu ${your_gpu_name} --low-vram --export-path /output_dir
podman run -it --rm \ --device nvidia.com/gpu=all \ -e NGC_API_KEY=$NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/output_dir \ --entrypoint "python3" \ nvcr.io/nim/black-forest-labs/flux.1-dev:1.0.1 \ optimize.py --gpu ${your_gpu_name} --low-vram --export-path /output_dir
Refer to Support Matrix in terms of precision specification using
--fp4
,--fp8
and--build-t5-fp8
flags.docker run -it --rm \ --runtime=nvidia \ --gpus '"device=0"' \ -e NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/output_dir \ --entrypoint "python3" \ nvcr.io/nim/black-forest-labs/flux.1-schnell:1.0.0 \ optimize.py --gpu ${your_gpu_name} --low-vram --export-path /output_dir
podman run -it --rm \ --schnellice nvidia.com/gpu=all \ -e NGC_API_KEY=$NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/output_dir \ --entrypoint "python3" \ nvcr.io/nim/black-forest-labs/flux.1-schnell:1.0.0 \ optimize.py --gpu ${your_gpu_name} --low-vram --export-path /output_dir
Refer to Support Matrix in terms of precision specification using
--fp4
,--fp8
and--build-t5-fp8
flags.The
optimize.py
script creates the following directories and files for the engine:$OUTPUT_DIR ├── metadata.json <- file with metadata needed to run the NIM ├── trt_engines_dir <- directory with optimized trt engines ├── framework_model_dir <- directory with configuration files for the model (e.g., diffusion scheduler config) └── manifest.yaml <- manifest file with the generated optimized profile that could be used for the default manifest overriding └── memory_profile.yaml <- memory profile including the VRAM, SRAM and Buffer usage for each pipeline stage used in offloading policy selection
Start the container with the optimized engine directory and manifest:
docker run -it --rm --name=nim-server \ --runtime=nvidia \ --gpus '"device=0"' \ -e NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \ -p 8000:8000 \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \ nvcr.io/nim/black-forest-labs/flux.1-dev:1.0.1
podman run -it --rm --name=nim-server \ --device nvidia.com/gpu=all \ -e NGC_API_KEY=$NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \ -p 8000:8000 \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \ nvcr.io/nim/black-forest-labs/flux.1-dev:1.0.1
docker run -it --rm --name=nim-server \ --runtime=nvidia \ --gpus '"device=0"' \ -e NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \ -p 8000:8000 \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \ nvcr.io/nim/black-forest-labs/flux.1-schnell:1.0.0
podman run -it --rm --name=nim-server \ --device nvidia.com/gpu=all \ -e NGC_API_KEY=$NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -e NIM_MANIFEST_PATH=/opt/nim/local/manifest.yaml \ -p 8000:8000 \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \ -v $(pwd)/$OUTPUT_DIR:/opt/nim/local \ nvcr.io/nim/black-forest-labs/flux.1-schnell:1.0.0
Parameters for the Container#
Flags |
Description |
---|---|
|
|
|
Delete the container after it stops (see Docker docs) |
|
Give a name to the NIM container. Use any preferred value |
|
Ensure NVIDIA drivers are accessible in the container |
|
Expose NVIDIA GPU 0 inside the container. If you are running on a host with multiple GPUs, you need to specify which GPU to use. See GPU Enumeration for further information on for mounting specific GPUs |
|
Provide the container with the token necessary to download adequate models and resources from NGC |
|
Mount the local |
|
Change the default entrypoint that starts NIM server to the |
|
Call of the optimization script with 2 required parameters |
Parameters for the Optimization Script#
Parameter |
Default Value |
Description |
---|---|---|
|
Required |
Path to the optimization output directory where TRT engines would be saved |
|
Required |
used GPU model |
|
1024 |
the optimal height of the generated images. Supported values {512,576,640,704,768,832,896,960,1024,1088,1152,1216,1280,1344} |
|
1024 |
the optimal width of the generated images. Supported values {512,576,640,704,768,832,896,960,1024,1088,1152,1216,1280,1344} |
|
HEIGHT |
the minimum height of generated images if not specified the value of –height will be used. Supported values {512,576,640,704,768,832,896,960,1024,1088,1152,1216,1280,1344} |
|
HEIGHT |
the maximum height of generated images if not specified the value of –height will be used. Supported values {512,576,640,704,768,832,896,960,1024,1088,1152,1216,1280,1344} |
|
WIDTH |
the minimum width of generated images if not specified the value of –width will be used. Supported values {512,576,640,704,768,832,896,960,1024,1088,1152,1216,1280,1344} |
|
WIDTH |
the maximum width of generated images if not specified the value of –width will be used. Supported values {512,576,640,704,768,832,896,960,1024,1088,1152,1216,1280,1344} |
|
Use fp4 checkpoint. Available only for GPU Compute Compatibility >= 10.0 (Blackwell) |
|
|
Use fp8 checkpoint. Available only for GPU Compute Compatibility >= 8.9 (Ada) |
|
|
Uses fp8 T5 model checkpoint instead of bf16 one. Independent from –fp4 and –fp8 flags. Available only for GPU Compute Compatibility >= 8.9 (Ada) |
|
|
None |
The percentage of the T5 weights that should be streamed from host to device. Useful for reducing device memory usage. Value should be between 0 and 100. Only supported for |
|
local:///opt/nim/local |
The information that would be used in NIM manifest to start the NIM server. |
|
Disables TRT optimization logs |
|
|
DEPRECATED: Offloading policy is automatically selected based on the memory profile |
|
|
Disables e2e pipeline run after the engines are build |
|
|
Enforce building new engines buy removing the old ones |
|
|
Disables memory profile generation |