Deploying NVIDIA NIMs | NVIDIA Brev Documentation

Run optimized NVIDIA Inference Microservices (NIMs) for production-ready AI inference on your Brev instance.

What are NIMs?

NVIDIA Inference Microservices (NIMs) are pre-built, optimized containers for running AI models with industry-standard APIs. They include:

CUDA, TensorRT, TensorRT-LLM, and Triton Inference Server
OpenAI-compatible API endpoints
Preconfigured for production workloads

A NIM container provides an interactive API for blazing fast inference. Deploying a large language model NIM requires the NIM container (API, server, runtime) and the model engine.

Prerequisites

NVIDIA NGC API key (get one at ngc.nvidia.com)
GPU with sufficient VRAM—see the NIM support matrix for requirements
Recommended: L40S 48GB or A100 80GB GPU

Setting Up Your Instance

Create a VM Mode Instance

In the Brev console, create a new instance:

Select VM Mode (not container mode)
Choose an appropriate GPU (L40S or A100 recommended)
Click Deploy

Connect to Your Instance

$ brev shell my-instance

Verify GPU Setup

$ docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

NGC Authentication

Set Your NGC API Key

$ export NGC_CLI_API_KEY=<your-ngc-api-key>

Persist the Key (Optional)

Add to your shell profile so it’s available on restart:

$ # For bash
$ echo "export NGC_CLI_API_KEY=<your-ngc-api-key>" >> ~/.bashrc
$ 
$ # For zsh
$ echo "export NGC_CLI_API_KEY=<your-ngc-api-key>" >> ~/.zshrc

Using the NGC CLI (Optional)

The NGC CLI provides additional functionality for browsing and managing NIMs.

$ # Install NGC CLI
$ pip install ngc-cli
$ 
$ # Configure with your API key
$ ngc config set
$ 
$ # List available NIMs
$ ngc registry image list --format_type csv nvcr.io/nim/meta/*

Refer to the NGC CLI documentation for more details.

Deploying a NIM

This example deploys the Llama 3 8B Instruct NIM. The same pattern applies to other NIMs.

Set Up Environment Variables

$ # Container name for bookkeeping
$ export CONTAINER_NAME=Llama3-8B-Instruct
$ 
$ # NIM image (check NGC for latest versions)
$ export IMG_NAME="nvcr.io/nim/meta/llama3-8b-instruct:1.0"
$ 
$ # Local cache for downloaded models
$ export LOCAL_NIM_CACHE=~/.cache/nim
$ mkdir -p "$LOCAL_NIM_CACHE"

Run the NIM Container

$ docker run -it --rm --name=$CONTAINER_NAME \
>   --runtime=nvidia \
>   --gpus all \
>   --shm-size=16GB \
>   -e NGC_CLI_API_KEY \
>   -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
>   -u $(id -u) \
>   -p 8000:8000 \
>   $IMG_NAME

The first run downloads the model weights, which may take several minutes depending on model size. Subsequent runs use the cached weights.

Test the API

In a new terminal (or use curl from another machine):

$ curl -X POST 'http://localhost:8000/v1/completions' \
>   -H 'accept: application/json' \
>   -H 'Content-Type: application/json' \
>   -d '{
>     "model": "meta-llama3-8b-instruct",
>     "prompt": "Once upon a time",
>     "max_tokens": 225
>   }'

Exposing Your NIM Endpoint

NIMs expose port 8000 by default. To make your NIM accessible externally:

Option 1: Port Forwarding (Recommended for API Access)

Use port forwarding for direct API access without authentication:

$ # From your local machine
$ brev port-forward my-instance --port 8000:8000
$ 
$ # Access at localhost:8000
$ curl -X POST http://localhost:8000/v1/chat/completions ...

Option 2: Cloudflare Tunnels (Web Console)

Expose through the web console for shareable URLs:

Go to your instance details in the Brev console
In the Access section, find Using Tunnels
Add port 8000 to create a public URL

Cloudflare Authentication: Tunnel URLs route through Cloudflare and require browser authentication on first access. For direct API access (e.g., from scripts or other services), use port forwarding instead.

API Endpoints

NIMs provide several OpenAI-compatible endpoints:

Endpoint	Description
`/v1/completions`	Text completion
`/v1/chat/completions`	Chat completion
`/v1/models`	List available models
`/v1/health/ready`	Health check
`/v1/metrics`	Prometheus metrics

Available NIMs

Explore the full catalog at build.nvidia.com. Popular NIMs include:

LLMs: Llama 3.1, Mistral, Mixtral, Nemotron
Vision: SegFormer, CLIP
Speech: Whisper, RIVA
Embedding: NV-Embed

Troubleshooting

Permission errors: If you encounter permission issues, try running with sudo:

$ sudo docker run -it --rm --name=$CONTAINER_NAME ...

Out of memory: Ensure your GPU has sufficient VRAM for the model. Check the NIM support matrix for requirements.