Getting Started#

Prerequisites#

Refer to the Support Matrix to make sure that you have the supported hardware and software stack.
An NGC personal API key. The NIM microservice uses the API key to download models from NVIDIA NGC. Refer to Generating a Personal API Key in the NVIDIA NGC User Guide for more information.

When you create an NGC API personal key, select at least NGC Catalog from the Services Included menu. You can specify more services to use the key for additional purposes.

Model Specific Credentials#

FLUX.1-dev

To access FLUX.1-dev model read and accept FLUX.1-dev , FLUX.1-Canny-dev , FLUX.1-Depth-dev and FLUX.1-dev-onnx License Agreements and Acceptable Use Policy.

Create a new Hugging Face token with Read access to contents of all public gated repos you can access permission.

FLUX.1-schnell

To access FLUX.1-schnell model read and accept FLUX.1-schnell and FLUX.1-schnell-onnx License Agreements and Acceptable Use Policy.

Create a new Hugging Face token with Read access to contents of all public gated repos you can access permission.

Running on Windows#

You can run NVIDIA NIM for Visual Generative AI on an RTX Windows system with Windows Subsystem for Linux (WSL).

Note

Support for Visual Generative AI NIMs on WSL is in Public Beta.

Refer to the NVIDIA NIM on WSL documentation for setup instructions.
Refer to the Supported Models to make sure hardware and software requirements are met.

By default, WSL has access to half of system RAM. To change the memory available for WSL create .wslconfig in the home directory C:\Users\<UserName> with the following content:
```
# Settings apply across all Linux distros running on WSL
[wsl2]

# Limits RAM memory to use no more than 38GB, this can be set as whole numbers using GB or MB
memory=38GB
```
Restart WSL instances to apply the configuration:
```
wsl --shutdown
```
For further customization of your WSL setup refer to WSL configuration.
Use the podman command examples in the following section.

Runtime Parameters for the Container#

Flags	Description
`-it`	`--interactive` + `--tty` (Refer to Docker documentation)
`--rm`	Delete the container after it stops (Refer to Docker documentation).
`--name=<container-name`	Give a name to the NIM container. Use any preferred value.
`--runtime=nvidia`	Ensure NVIDIA drivers are accessible in the container.
`--gpus '"device=0"'`	Expose NVIDIA GPU 0 inside the container. If you are running on a host with multiple GPUs, you need to specify which GPU to use. See GPU Enumeration for further information on for mounting specific GPUs.
`-e NGC_API_KEY=$NGC_API_KEY`	Provide the container with the token necessary to download adequate models and resources from NGC.
`-e NIM_MODEL_PROFILE=<profile>`	Specify the profile to load. Refer to Models for information about the available profiles.
`-e NIM_MODEL_VARIANT=<variant>`	Specify the preferred model variant to select. By default, the container selects the first available for the host GPU model.
`-e NIM_OFFLOADING_POLICY=<offloading_policy>`	Specify the preferred offloading policy: `disk` to offload all models to the disk. `system_ram` to offload all models to SRAM. `none` to disable offloading. See more in the NIM Offloading Policies
`-e NIM_TRITON_REQUEST_TIMEOUT=<timeout>`	Specify the preferred inference request timeout in microseconds. The default value is `90000000` (90 seconds).
`-p 8000:8000`	Forward the port where the NIM HTTP server is published inside the container to access from the host system. The left-hand side of `:` is the host system ip:port (`8000` here), while the right-hand side is the container port where the NIM HTTP server is published.

NIM Offloading Policies#

Visual GenAI NIMs support multiple model offloading policies, allowing optimization of model deployment based on specific use cases and host system resources.

The following offloading policies are currently supported:

Policy	Description	Performance Impact	SRAM Usage	VRAM Usage
disk	Offloads all models to the disk, reducing the memory footprint of the NIM.	High	Low	Low
system_ram	Offloads all models to the system RAM (SRAM), providing faster access to the models compared to disk storage.	Medium	High	Low
none	Disables offloading, storing all models in VRAM.	-	Low	High
default	Automatically selects the best offloading policy based on the host system’s resources.	Varies	Varies	Varies

The offloading policy can be selected using the NIM_OFFLOADING_POLICY environmental variable. By setting this variable to one of the supported policies, the NIM will use the specified policy to manage model offloading.

For detailed information on VRAM and SRAM usage for each policy, please refer to Support Matrix.

Stopping the Container#

The following commands stop the container by stopping and removing the running docker container.

docker

docker stop nim-server
docker rm nim-server

podman

podman stop nim-server
podman rm nim-server

Troubleshooting FAQ#

Q: The server returns a 500 error and logs the exception [StatusCode.DEADLINE_EXCEEDED] Deadline Exceeded. How can I fix this?
A: Increase the NIM request timeout by setting the environment variable NIM_TRITON_REQUEST_TIMEOUT.

Q: NIM fails to start and reports Authentication error: The requested operation requires authentication, but the provided credentials are invalid.
A: Make sure that you exported the environment variables NGC_API_KEY and HF_TOKEN and that HF_TOKEN has the required permissions, as described in the Model Specific Credentials section.

Next Steps#

Configuration for environment variables and command-line arguments.
Customization to build a custom engine for your GPU model and host.