Getting Started#

Prerequisites#

  • Refer to the Support Matrix to make sure that you have the supported hardware and software stack.

  • An NGC personal API key. The NIM microservice uses the API key to download models from NVIDIA NGC. Refer to Generating a Personal API Key in the NVIDIA NGC User Guide for more information.

    When you create an NGC API personal key, select at least NGC Catalog from the Services Included menu. You can specify more services to use the key for additional purposes.

Model Specific Credentials#

To access FLUX.1-dev model read and accept FLUX.1-dev , FLUX.1-Canny-dev , FLUX.1-Depth-dev and FLUX.1-dev-onnx License Agreements and Acceptable Use Policy.

Create a new Hugging Face token with Read access to contents of all public gated repos you can access permission.

Running on Windows#

You can run NVIDIA NIM for Visual Generative AI on an RTX Windows system with Windows Subsystem for Linux (WSL).

Note

Support for Visual Generative AI NIMs on WSL is in Public Beta.

  1. Refer to the NVIDIA NIM on WSL documentation for setup instructions.

  2. Refer to the Supported Models to make sure hardware and software requirements are met.

    By default, WSL has access to half of system RAM. To change the memory available for WSL create .wslconfig in the home directory C:\Users\<UserName> with the following content:

    # Settings apply across all Linux distros running on WSL
    [wsl2]
    
    # Limits RAM memory to use no more than 38GB, this can be set as whole numbers using GB or MB
    memory=38GB
    

    Restart WSL instances to apply the configuration:

    wsl --shutdown
    

    For further customization of your WSL setup refer to WSL configuration.

  3. Use the podman command examples in the following section.

Starting the NIM Container#

  1. Export your personal credentials as environment variables:

    export NGC_API_KEY="..."
    export HF_TOKEN="..."
    

    A more secure alternative is to use a password manager, such as pass.

  2. Login to NVIDIA NGC so that you can pull the NIM container:

    echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
    
    echo "$NGC_API_KEY" | podman login nvcr.io --username '$oauthtoken' --password-stdin
    

    Use $oauthtoken as the user name and $NGC_API_KEY as the password. The $oauthtoken user name indicates that you authenticate with an API key and not a user name and password.

  3. Start the NIM container with one of the Visual Generative AI models:

    # Create the cache directory on the host machine.
    export LOCAL_NIM_CACHE=~/.cache/nim
    mkdir -p "$LOCAL_NIM_CACHE"
    chmod 777 $LOCAL_NIM_CACHE
    
    docker run -it --rm --name=nim-server \
       --runtime=nvidia \
       --gpus='"device=0"' \
       -e NGC_API_KEY=$NGC_API_KEY \
       -e HF_TOKEN=$HF_TOKEN \
       -p 8000:8000 \
       -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
       nvcr.io/nim/black-forest-labs/flux.1-dev:1.0.1
    
    # Create the cache directory on the host machine.
    export LOCAL_NIM_CACHE=~/.cache/nim
    mkdir -p "$LOCAL_NIM_CACHE"
    chmod 777 $LOCAL_NIM_CACHE
    
    podman run -it --rm --name=nim-server \
      --device nvidia.com/gpu=all \
      -e NGC_API_KEY=$NGC_API_KEY \
      -e HF_TOKEN=$HF_TOKEN \
      -p 8000:8000 \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
      nvcr.io/nim/black-forest-labs/flux.1-dev:1.0.1
    

    You can specify the desired variant of FLUX by adding -e NIM_MODEL_VARIANT=<you variant>. Available variants are base, canny, depth and their combinations, such as base+depth.

    When you run the preceding command, the container downloads the model, initializes a NIM inference pipeline, and performs a pipeline warm up. A pipeline warm up typically requires up to three minutes. The warm up is complete when the container logs show Pipeline warmup: start/done.

  4. Optional: Confirm the service is ready to respond to inference requests:

    $ curl -X GET http://localhost:8000/v1/health/ready
    

    Example Output

    {"status":"ready"}
    
  5. Send an inference request:

    Select an example according to the deployed model variant.

    invoke_url="http://localhost:8000/v1/infer"
    
    output_image_path="result.jpg"
    
    response=$(curl -X POST $invoke_url \
        -H "Accept: application/json" \
        -H "Content-Type: application/json" \
        -d '{
              "prompt": "A simple coffee shop interior",
              "mode": "base",
              "seed": 0,
              "steps": 50
            }')
    response_body=$(echo "$response" | awk '/{/,EOF-1')
    echo $response_body | jq .artifacts[0].base64 | tr -d '"' | base64 --decode > $output_image_path
    
    invoke_url="http://localhost:8000/v1/infer"
    
    input_image_path="input.jpg"
    # download an example image
    curl https://assets.ngc.nvidia.com/products/api-catalog/flux/input/1.jpg > $input_image_path
    image_b64=$(base64 -w 0 $input_image_path)
    
    echo '{
        "prompt": "A simple coffee shop interior",
        "mode": "canny",
        "image": "data:image/png;base64,'${image_b64}'",
        "preprocess_image": true,
        "seed": 0,
        "steps": 50
    }' > payload.json
    
    output_image_path="result.jpg"
    
    response=$(curl -X POST $invoke_url \
        -H "Accept: application/json" \
        -H "Content-Type: application/json" \
        -d @payload.json )
    response_body=$(echo "$response" | awk '/{/,EOF-1')
    echo $response_body | jq .artifacts[0].base64 | tr -d '"' | base64 --decode > $output_image_path
    
    invoke_url="http://localhost:8000/v1/infer"
    
    input_image_path="input.jpg"
    # download an example image
    curl https://assets.ngc.nvidia.com/products/api-catalog/flux/input/1.jpg > $input_image_path
    image_b64=$(base64 -w 0 $input_image_path)
    
    echo '{
        "prompt": "A simple coffee shop interior",
        "mode": "depth",
        "image": "data:image/png;base64,'${image_b64}'",
        "preprocess_image": true,
        "seed": 0,
        "steps": 50
    }' > payload.json
    
    output_image_path="result.jpg"
    
    response=$(curl -X POST $invoke_url \
        -H "Accept: application/json" \
        -H "Content-Type: application/json" \
        -d @payload.json )
    response_body=$(echo "$response" | awk '/{/,EOF-1')
    echo $response_body | jq .artifacts[0].base64 | tr -d '"' | base64 --decode > $output_image_path
    

    The prompt parameter represents the description of the image to generate. image parameter takes an input image in base64 format and preprocess_image indicates if the image shoould be preprocess to canny edges or depth map according to the mode. seed parameter governs the generation process (use 0 to generate a new image each call).

    Refer to the API Reference for parameter descriptions.

Runtime Parameters for the Container#

Flags

Description

-it

--interactive + --tty (Refer to Docker documentation)

--rm

Delete the container after it stops (Refer to Docker documentation).

--name=<container-name

Give a name to the NIM container. Use any preferred value.

--runtime=nvidia

Ensure NVIDIA drivers are accessible in the container.

--gpus '"device=0"'

Expose NVIDIA GPU 0 inside the container. If you are running on a host with multiple GPUs, you need to specify which GPU to use. See GPU Enumeration for further information on for mounting specific GPUs.

-e NGC_API_KEY=$NGC_API_KEY

Provide the container with the token necessary to download adequate models and resources from NGC.

-e NIM_MODEL_PROFILE=<profile>

Specify the profile to load. Refer to Models for information about the available profiles.

-e NIM_MODEL_VARIANT=<variant>

Specify the preferred model variant to select. By default, the container selects the first available for the host GPU model.

-e NIM_OFFLOADING_POLICY=<offloading_policy>

Specify the preferred offloading policy: disk to offload all models to the disk. system_ram to offload all models to SRAM. none to disable offloading. See more in the NIM Offloading Policies

-p 8000:8000

Forward the port where the NIM HTTP server is published inside the container to access from the host system. The left-hand side of : is the host system ip:port (8000 here), while the right-hand side is the container port where the NIM HTTP server is published.

NIM Offloading Policies#

Visual GenAI NIMs support multiple model offloading policies, allowing optimization of model deployment based on specific use cases and host system resources.

The following offloading policies are currently supported:

Policy

Description

Performance Impact

SRAM Usage

VRAM Usage

disk

Offloads all models to the disk, reducing the memory footprint of the NIM.

High

Low

Low

system_ram

Offloads all models to the system RAM (SRAM), providing faster access to the models compared to disk storage.

Medium

High

Low

none

Disables offloading, storing all models in VRAM.

-

Low

High

default

Automatically selects the best offloading policy based on the host system’s resources.

Varies

Varies

Varies

The offloading policy can be selected using the NIM_OFFLOADING_POLICY environmental variable. By setting this variable to one of the supported policies, the NIM will use the specified policy to manage model offloading.

For detailed information on VRAM and SRAM usage for each policy, please refer to Support Matrix.

Stopping the Container#

The following commands stop the container by stopping and removing the running docker container.

docker stop nim-server
docker rm nim-server
podman stop nim-server
podman rm nim-server

Next Steps#

  • Configuration for environment variables and command-line arguments.

  • Customization to build a custom engine for your GPU model and host.