Run:ai#
This page describes how to deploy NVIDIA NIM for LLMs on Run:ai.
Prerequisites#
Run:ai access (SaaS or self-hosted) with GPU capacity for inference workloads.
Access to a Run:ai project where you can create inference workloads.
NGC credentials for pulling model images and downloading artifacts.
Note
For Run:ai platform setup and operations guidance, refer to the Welcome to NVIDIA Run:ai Documentation
Deploy NIM on Run:ai#
For baseline Run:ai workflow details, refer to Deploy Run:ai Inference Workloads with NVIDIA NIM.
In the Run:ai UI, create an inference workload with the following settings:
Create a new inference workload.
Select NVIDIA NIM as the inference type.
Set the workload name and credentials.
Set the required GPU count for your model.
Optional: Use advanced settings to specify a custom NIM image.
Create the inference workload.
Tip
Start with one GPU and scale after you confirm successful model loading and readiness.
Optional: Enable LoRA on Run:ai#
Run:ai supports mounting a data source (for example, a Kubernetes PVC when using a Kubernetes-based cluster). You can use this flow to provide LoRA adapters to NIM.
Create a Data Source#
In the Run:ai UI, open Workload Manager > Assets > Data & Storage > Data Sources, and then create a new PVC-backed (or equivalent) data source.
When Run:ai uses a Kubernetes cluster, it creates a PVC for the data source. Populate that volume with LoRA adapter files by using the same directory structure described in Optional: Enable LoRA With Helm:
/loras/
adapter_name/
adapter_config.json
adapter_model.safetensors # or adapter_model.bin
Deploy NIM With LoRA Configuration#
Create or update the inference workload, and then:
Set
NIM_PEFT_SOURCEto/lorasin runtime environment variables.Attach the Run:ai data source and mount it at
/loras.Create the inference workload.
Note
If the mounted data source does not contain adapter files, NIM starts normally, but no LoRA adapters are available at runtime.
Verify Deployment#
Verify workload readiness in the Run:ai UI. A healthy deployment shows the inference workload in a ready state. To confirm the NIM service is responding, call the readiness endpoint (for example, from a client that can reach the workload endpoint). A healthy deployment returns an HTTP 200 response from /v1/health/ready.