Run:ai#
This page describes how to deploy NVIDIA NIM for LLMs on Run:ai.
Prerequisites#
Before deploying NIM on Run:ai, make sure you have the following:
Run:ai access (SaaS or self-hosted) with GPU capacity for inference workloads
Access to a Run:ai project where you can create inference workloads
An NGC API key for pulling NIM container images and downloading model artifacts
Note
For Run:ai platform setup and operations guidance, refer to the Welcome to NVIDIA Run:ai Documentation.
Deploy NIM on Run:ai#
For baseline Run:ai workflow details, refer to Deploy Run:ai Inference Workloads with NVIDIA NIM.
In the Run:ai UI, create an inference workload with the following settings:
Create a new inference workload.
Select NVIDIA NIM as the inference type.
Set the workload name and credentials.
Set the required GPU count for your model.
Optional: Use advanced settings to specify a custom NIM image.
Create the inference workload.
Tip
Start with one GPU and scale after you confirm successful model loading and readiness.
Optional: Enable LoRA on Run:ai#
Run:ai supports mounting a data source, such as a Kubernetes PVC in a Kubernetes-based cluster. You can use this flow to provide LoRA adapters to NIM.
Create a Data Source#
In the Run:ai UI, open Workload Manager > Assets > Data & Storage > Data Sources, and then create a new PVC-backed (or equivalent) data source.
When Run:ai uses a Kubernetes cluster, it creates a PVC for the data source. Populate that volume with LoRA adapter files using the same directory structure described in Optional: Enable LoRA With Helm:
/loras/
adapter_name/
adapter_config.json
adapter_model.safetensors # or adapter_model.bin
Deploy NIM With LoRA Configuration#
Create or update the inference workload, and then:
Set
NIM_PEFT_SOURCEto/lorasin runtime environment variables.Attach the Run:ai data source and mount it at
/loras.Create the inference workload.
Note
If the mounted data source does not contain adapter files, NIM starts normally, but no LoRA adapters are available at runtime.
Verify Deployment#
Verify workload readiness in the Run:ai UI. A healthy deployment shows the inference workload in a ready state. To confirm that the NIM service is responding, call the readiness endpoint from a client that can reach the workload endpoint. A healthy deployment returns an HTTP 200 response from /v1/health/ready.