Deploying NVIDIA NIMs

View as Markdown

Run optimized NVIDIA Inference Microservices (NIMs) for production-ready AI inference on your Brev instance.

What are NIMs?

NVIDIA Inference Microservices (NIMs) are pre-built, optimized containers for running AI models with industry-standard APIs. They include:

  • CUDA, TensorRT, TensorRT-LLM, and Triton Inference Server
  • OpenAI-compatible API endpoints
  • Preconfigured for production workloads

A NIM container provides an interactive API for blazing fast inference. Deploying a large language model NIM requires the NIM container (API, server, runtime) and the model engine.

Prerequisites

Setting Up Your Instance

1

Create a VM Mode Instance

In the Brev console, create a new instance:

  • Select VM Mode (not container mode)
  • Choose an appropriate GPU (L40S or A100 recommended)
  • Click Deploy
2

Connect to Your Instance

$brev shell my-instance
3

Verify GPU Setup

$docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

NGC Authentication

1

Set Your NGC API Key

$export NGC_CLI_API_KEY=<your-ngc-api-key>
2

Persist the Key (Optional)

Add to your shell profile so it’s available on restart:

$# For bash
$echo "export NGC_CLI_API_KEY=<your-ngc-api-key>" >> ~/.bashrc
$
$# For zsh
$echo "export NGC_CLI_API_KEY=<your-ngc-api-key>" >> ~/.zshrc
3

Login to NGC Container Registry

$echo "$NGC_CLI_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Using the NGC CLI (Optional)

The NGC CLI provides additional functionality for browsing and managing NIMs.

$# Install NGC CLI
$pip install ngc-cli
$
$# Configure with your API key
$ngc config set
$
$# List available NIMs
$ngc registry image list --format_type csv nvcr.io/nim/meta/*

Refer to the NGC CLI documentation for more details.

Deploying a NIM

This example deploys the Llama 3 8B Instruct NIM. The same pattern applies to other NIMs.

1

Set Up Environment Variables

$# Container name for bookkeeping
$export CONTAINER_NAME=Llama3-8B-Instruct
$
$# NIM image (check NGC for latest versions)
$export IMG_NAME="nvcr.io/nim/meta/llama3-8b-instruct:1.0"
$
$# Local cache for downloaded models
$export LOCAL_NIM_CACHE=~/.cache/nim
$mkdir -p "$LOCAL_NIM_CACHE"
2

Run the NIM Container

$docker run -it --rm --name=$CONTAINER_NAME \
> --runtime=nvidia \
> --gpus all \
> --shm-size=16GB \
> -e NGC_CLI_API_KEY \
> -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
> -u $(id -u) \
> -p 8000:8000 \
> $IMG_NAME

The first run downloads the model weights, which may take several minutes depending on model size. Subsequent runs use the cached weights.

3

Test the API

In a new terminal (or use curl from another machine):

$curl -X POST 'http://localhost:8000/v1/completions' \
> -H 'accept: application/json' \
> -H 'Content-Type: application/json' \
> -d '{
> "model": "meta-llama3-8b-instruct",
> "prompt": "Once upon a time",
> "max_tokens": 225
> }'

Exposing Your NIM Endpoint

NIMs expose port 8000 by default. To make your NIM accessible externally:

Use port forwarding for direct API access without authentication:

$# From your local machine
$brev port-forward my-instance --port 8000:8000
$
$# Access at localhost:8000
$curl -X POST http://localhost:8000/v1/chat/completions ...

Option 2: Cloudflare Tunnels (Web Console)

Expose through the web console for shareable URLs:

  1. Go to your instance details in the Brev console
  2. In the Access section, find Using Tunnels
  3. Add port 8000 to create a public URL

Cloudflare Authentication: Tunnel URLs route through Cloudflare and require browser authentication on first access. For direct API access (e.g., from scripts or other services), use port forwarding instead.

API Endpoints

NIMs provide several OpenAI-compatible endpoints:

EndpointDescription
/v1/completionsText completion
/v1/chat/completionsChat completion
/v1/modelsList available models
/v1/health/readyHealth check
/v1/metricsPrometheus metrics

Available NIMs

Explore the full catalog at build.nvidia.com. Popular NIMs include:

  • LLMs: Llama 3.1, Mistral, Mixtral, Nemotron
  • Vision: SegFormer, CLIP
  • Speech: Whisper, RIVA
  • Embedding: NV-Embed

Troubleshooting

Permission errors: If you encounter permission issues, try running with sudo:

$sudo docker run -it --rm --name=$CONTAINER_NAME ...

Out of memory: Ensure your GPU has sufficient VRAM for the model. Check the NIM support matrix for requirements.