Run:ai#

This page describes how to deploy NVIDIA NIM for LLMs on Run:ai.

Prerequisites#

Before deploying NIM on Run:ai, make sure you have the following:

  • Run:ai access (SaaS or self-hosted) with GPU capacity for inference workloads

  • Access to a Run:ai project where you can create inference workloads

  • An NGC API key for pulling NIM container images and downloading model artifacts

Note

For Run:ai platform setup and operations guidance, refer to the Welcome to NVIDIA Run:ai Documentation.

Deploy NIM on Run:ai#

For baseline Run:ai workflow details, refer to Deploy Run:ai Inference Workloads with NVIDIA NIM.

In the Run:ai UI, create an inference workload with the following settings:

  1. Create a new inference workload.

  2. Select NVIDIA NIM as the inference type.

  3. Set the workload name and credentials.

  4. Set the required GPU count for your model.

  5. Optional: Use advanced settings to specify a custom NIM image.

  6. Create the inference workload.

Tip

Start with one GPU and scale after you confirm successful model loading and readiness.

Optional: Enable LoRA on Run:ai#

Run:ai supports mounting a data source, such as a Kubernetes PVC in a Kubernetes-based cluster. You can use this flow to provide LoRA adapters to NIM.

Create a Data Source#

In the Run:ai UI, open Workload Manager > Assets > Data & Storage > Data Sources, and then create a new PVC-backed (or equivalent) data source.

When Run:ai uses a Kubernetes cluster, it creates a PVC for the data source. Populate that volume with LoRA adapter files using the same directory structure described in Optional: Enable LoRA With Helm:

/loras/
  adapter_name/
    adapter_config.json
    adapter_model.safetensors   # or adapter_model.bin

Deploy NIM With LoRA Configuration#

Create or update the inference workload, and then:

  1. Set NIM_PEFT_SOURCE to /loras in runtime environment variables.

  2. Attach the Run:ai data source and mount it at /loras.

  3. Create the inference workload.

Note

If the mounted data source does not contain adapter files, NIM starts normally, but no LoRA adapters are available at runtime.

Verify Deployment#

Verify workload readiness in the Run:ai UI. A healthy deployment shows the inference workload in a ready state. To confirm that the NIM service is responding, call the readiness endpoint from a client that can reach the workload endpoint. A healthy deployment returns an HTTP 200 response from /v1/health/ready.