Is this page helpful?

Run:ai#

This page describes how to deploy NVIDIA NIM for LLMs on Run:ai.

Prerequisites#

Before deploying NIM on Run:ai, make sure you have the following:

Run:ai access (SaaS or self-hosted) with GPU capacity for inference workloads
Access to a Run:ai project where you can create inference workloads
An NGC API key for pulling NIM container images and downloading model artifacts

Note

For Run:ai platform setup and operations guidance, refer to the Welcome to NVIDIA Run:ai Documentation.

Deploy NIM on Run:ai#

For baseline Run:ai workflow details, refer to Deploy Run:ai Inference Workloads with NVIDIA NIM.

In the Run:ai UI, create an inference workload with the following settings:

Create a new inference workload.
Select NVIDIA NIM as the inference type.
Set the workload name and credentials.
Set the required GPU count for your model.
Optional: Use advanced settings to specify a custom NIM image.
Create the inference workload.

Tip

Start with one GPU and scale after you confirm successful model loading and readiness.

Enable LoRA on Run:ai#

Note

This section is optional.

Run:ai supports mounting a data source, such as a Kubernetes PVC in a Kubernetes-based cluster. You can use this flow to provide LoRA adapters to NIM.

Create a Data Source#

In the Run:ai UI, open Workload Manager > Assets > Data & Storage > Data Sources, and then create a new PVC-backed (or equivalent) data source.

When Run:ai uses a Kubernetes cluster, it creates a PVC for the data source. Populate that volume with LoRA adapter files using the same directory structure described in Enable LoRA with Helm:

/loras/
  adapter_name/
    adapter_config.json
    adapter_model.safetensors   # or adapter_model.bin

Deploy NIM with LoRA Configuration#

Create or update the inference workload, and then:

Set NIM_PEFT_SOURCE to /loras in runtime environment variables.
Attach the Run:ai data source and mount it at /loras.
Create the inference workload.

Note

If the mounted data source does not contain adapter files, NIM starts normally, but no LoRA adapters are available at runtime.

Verify Deployment#

Verify workload readiness in the Run:ai UI. A healthy deployment shows the inference workload in a ready state. To confirm that the NIM service is responding, call the readiness endpoint from a client that can reach the workload endpoint. A healthy deployment returns an HTTP 200 response from /v1/health/ready.