Getting Started#

Prerequisites#

Refer to the Support Matrix to make sure that you have the supported hardware and software stack.
An NGC personal API key. The NIM microservice uses the API key to download models from NVIDIA NGC. Refer to Generating a Personal API Key in the NVIDIA NGC User Guide for more information.

When you create an NGC API personal key, select at least NGC Catalog from the Services Included menu. You can specify more services to use the key for additional purposes.

Supported NIMs#

Based on your GPU, review the available model profiles, choose a compatible NIM version, and then export its container name.

export CONTAINER_NAME=<selected NIM container>

Model	Container name	Supported variants	Extra license agreement
FLUX.1-dev	nvcr.io/nim/black-forest-labs/flux.1-dev:1.2.1	base, canny, depth	FLUX.1-dev, FLUX.1-dev-onnx
FLUX.1-Kontext-dev	nvcr.io/nim/black-forest-labs/flux.1-kontext-dev:1.1.2	base	FLUX.1-Kontext-dev, FLUX.1-Kontext-dev-onnx
FLUX.1-schnell	nvcr.io/nim/black-forest-labs/flux.1-schnell:1.1.2	base	FLUX.1-schnell, FLUX.1-schnell-onnx
FLUX.2-klein-4B	nvcr.io/nim/black-forest-labs/flux.2-klein-4b:1.0.0-variant	-	-
Stable Diffusion 3.5 Large	nvcr.io/nim/stabilityai/stable-diffusion-3.5-large:1.0.2	base, canny, depth	Stable Diffusion 3.5 Large, Stable Diffusion 3.5 Large TensorRT, Stable Diffusion 3.5 Large ControlNet TensorRT
TRELLIS	nvcr.io/nim/microsoft/trellis:1.0.1	base:text, large:text, large:image	-

If the model requires additional license agreements, review and accept them. Create a new Hugging Face token and grant it Read access to contents of all public gated repos you can access. If the model supports multiple variants, set the desired variant with -e NIM_MODEL_VARIANT=<variant>. To select multiple variants, join them with + (for example, base+depth or large:text+large:image).

Running on Windows#

You can run NVIDIA NIM for Visual Generative AI on an RTX Windows system with Windows Subsystem for Linux (WSL).

Note

Support for Visual Generative AI NIMs on WSL is in Public Beta.

Refer to the NVIDIA NIM on WSL documentation for setup instructions.
Refer to the Supported Models to make sure hardware and software requirements are met.

By default, WSL has access to half of system RAM. To change the memory available for WSL create .wslconfig in the home directory C:\Users\<UserName> with the following content:
```
# Settings apply across all Linux distros running on WSL
[wsl2]

# Limits RAM memory to use no more than 48GB, this can be set as whole numbers using GB or MB
memory=48GB
```
Restart WSL instances to apply the configuration:
```
wsl --shutdown
```
For further customization of your WSL setup refer to WSL configuration.
Use the podman command examples in the following section.

Runtime Parameters for the Container#

Flags	Description
`-it`	`--interactive` + `--tty` (Refer to Docker documentation)
`--rm`	Delete the container after it stops (Refer to Docker documentation).
`--name=<container-name`	Give a name to the NIM container. Use any preferred value.
`--runtime=nvidia`	Ensure NVIDIA drivers are accessible in the container.
`--gpus '"device=0"'`	Expose NVIDIA GPU 0 inside the container. If you are running on a host with multiple GPUs, you need to specify which GPU to use. See GPU Enumeration for further information on for mounting specific GPUs.
`-e NGC_API_KEY=$NGC_API_KEY`	Provide the container with the token necessary to download adequate models and resources from NGC.
`-e NIM_MODEL_PROFILE=<profile>`	Specify the profile to load. Refer to Models for information about the available profiles.
`-e NIM_MODEL_VARIANT=<variant>`	Specify the preferred model variant to select. By default, the container selects the first available for the host GPU model.
`-e NIM_MODEL_VERSION=<version>`	Specify the desired model version to load, if the model supports multiple versions. By default, the container selects the latest available version.
`-e NIM_OFFLOADING_POLICY=<offloading_policy>`	Specify the preferred offloading policy: `disk` to offload all models to the disk. `system_ram` to offload all models to SRAM. `none` to disable offloading. Not supported by FLUX.2-klein-4B NIM. See more in the NIM Offloading Policies.
`-e NIM_TRITON_REQUEST_TIMEOUT=<timeout>`	Specify the preferred inference request timeout in microseconds. The default value is `90000000` (90 seconds).
`-e NIM_ALLOW_UNCHECKED_GENERATION=<allow_unchecked_generation>`	By default, built-in content safety filtering (Cosmos-1.0-Guardrail) is enabled. Filtered responses are returned with “finish_reason”: “CONTENT_FILTERED” for the NIM API and a HTTP 422 response for the OpenAI-compatible API. If you have your own guardrail appropriate for your use case, you can disable the default guardrail by using -e NIM_ALLOW_UNCHECKED_GENERATION=true.
`-p 8000:8000`	Forward the port where the NIM HTTP server is published inside the container to access from the host system. The left-hand side of `:` is the host system ip:port (`8000` here), while the right-hand side is the container port where the NIM HTTP server is published.

NIM Offloading Policies#

Visual GenAI NIMs support multiple model offloading policies, allowing optimization of model deployment based on specific use cases and host system resources.

The following offloading policies are currently supported:

Policy	Description	Performance Impact	SRAM Usage	VRAM Usage
disk	Offloads all models to the disk, reducing the memory footprint of the NIM.	High	Low	Low
system_ram	Offloads all models to the system RAM (SRAM), providing faster access to the models compared to disk storage.	Medium	High	Low
none	Disables offloading, storing all models in VRAM.	-	Low	High
default	Automatically selects the best offloading policy based on the host system’s resources.	Varies	Varies	Varies

The offloading policy can be selected using the NIM_OFFLOADING_POLICY environmental variable. By setting this variable to one of the supported policies, the NIM will use the specified policy to manage model offloading.

Note

NIM_OFFLOADING_POLICY and NIM_OFFLOADING_POLICY_RESERVED_VRAM_GB are not supported by FLUX.2-klein-4B NIM.

For detailed information on VRAM and SRAM usage for each policy, please refer to Support Matrix.

Stopping the Container#

The following commands stop the container by stopping and removing the running docker container.

docker

docker stop nim-server
docker rm nim-server

podman

podman stop nim-server
podman rm nim-server

Troubleshooting FAQ#

Q: The server returns a 500 error and logs the exception [StatusCode.DEADLINE_EXCEEDED] Deadline Exceeded. How can I fix this?
A: Increase the NIM request timeout by setting the environment variable NIM_TRITON_REQUEST_TIMEOUT.

Q: NIM fails to start and reports Authentication error: The requested operation requires authentication, but the provided credentials are invalid.
A: Make sure that you exported the NGC_API_KEY and HF_TOKEN environment variables, and that HF_TOKEN has the required permissions, as described in the Supported NIMs section.

Next Steps#

Configuration for environment variables and command-line arguments.
Customization to build a custom engine for your GPU model and host.