Run:ai#

This page describes how to deploy NVIDIA NIM for LLMs on Run:ai.

Prerequisites#

Run:ai access (SaaS or self-hosted) with GPU capacity for inference workloads.
Access to a Run:ai project where you can create inference workloads.
NGC credentials for pulling model images and downloading artifacts.

Note

For Run:ai platform setup and operations guidance, refer to the Welcome to NVIDIA Run:ai Documentation

Deploy NIM on Run:ai#

For baseline Run:ai workflow details, refer to Deploy Run:ai Inference Workloads with NVIDIA NIM.

In the Run:ai UI, create an inference workload with the following settings:

Create a new inference workload.
Select NVIDIA NIM as the inference type.
Set the workload name and credentials.
Set the required GPU count for your model.
Optional: Use advanced settings to specify a custom NIM image.
Create the inference workload.

Tip

Start with one GPU and scale after you confirm successful model loading and readiness.

Optional: Enable LoRA on Run:ai#

Run:ai supports mounting a data source (for example, a Kubernetes PVC when using a Kubernetes-based cluster). You can use this flow to provide LoRA adapters to NIM.

Create a Data Source#

In the Run:ai UI, open Workload Manager > Assets > Data & Storage > Data Sources, and then create a new PVC-backed (or equivalent) data source.

When Run:ai uses a Kubernetes cluster, it creates a PVC for the data source. Populate that volume with LoRA adapter files by using the same directory structure described in Optional: Enable LoRA With Helm:

/loras/
  adapter_name/
    adapter_config.json
    adapter_model.safetensors   # or adapter_model.bin

Deploy NIM With LoRA Configuration#

Create or update the inference workload, and then:

Set NIM_PEFT_SOURCE to /loras in runtime environment variables.
Attach the Run:ai data source and mount it at /loras.
Create the inference workload.

Note

If the mounted data source does not contain adapter files, NIM starts normally, but no LoRA adapters are available at runtime.

Verify Deployment#

Verify workload readiness in the Run:ai UI. A healthy deployment shows the inference workload in a ready state. To confirm the NIM service is responding, call the readiness endpoint (for example, from a client that can reach the workload endpoint). A healthy deployment returns an HTTP 200 response from /v1/health/ready.