***
title: Deploying NVIDIA NIMs
description: Deploy NVIDIA Inference Microservices (NIMs) on your Brev GPU instance.
------------------------------------------------------------------------------------
Run optimized NVIDIA Inference Microservices (NIMs) for production-ready AI inference on your Brev instance.
## What are NIMs?
NVIDIA Inference Microservices (NIMs) are pre-built, optimized containers for running AI models with industry-standard APIs. They include:
* CUDA, TensorRT, TensorRT-LLM, and Triton Inference Server
* OpenAI-compatible API endpoints
* Preconfigured for production workloads
A NIM container provides an interactive API for blazing fast inference. Deploying a large language model NIM requires the NIM container (API, server, runtime) and the model engine.
## Prerequisites
* NVIDIA NGC API key (get one at [ngc.nvidia.com](https://ngc.nvidia.com))
* GPU with sufficient VRAM—see the [NIM support matrix](https://docs.nvidia.com/nim/large-language-models/latest/support-matrix.html) for requirements
* Recommended: L40S 48GB or A100 80GB GPU
## Setting Up Your Instance
In the Brev console, create a new instance:
* Select **VM Mode** (not container mode)
* Choose an appropriate GPU (L40S or A100 recommended)
* Click **Deploy**
```bash
brev shell my-instance
```
```bash
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
```
## NGC Authentication
```bash
export NGC_CLI_API_KEY=
```
Add to your shell profile so it's available on restart:
```bash
# For bash
echo "export NGC_CLI_API_KEY=" >> ~/.bashrc
# For zsh
echo "export NGC_CLI_API_KEY=" >> ~/.zshrc
```
```bash
echo "$NGC_CLI_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
```
## Using the NGC CLI (Optional)
The NGC CLI provides additional functionality for browsing and managing NIMs.
```bash
# Install NGC CLI
pip install ngc-cli
# Configure with your API key
ngc config set
# List available NIMs
ngc registry image list --format_type csv nvcr.io/nim/meta/*
```
Refer to the [NGC CLI documentation](https://docs.ngc.nvidia.com/cli/index.html) for more details.
## Deploying a NIM
This example deploys the Llama 3 8B Instruct NIM. The same pattern applies to other NIMs.
```bash
# Container name for bookkeeping
export CONTAINER_NAME=Llama3-8B-Instruct
# NIM image (check NGC for latest versions)
export IMG_NAME="nvcr.io/nim/meta/llama3-8b-instruct:1.0"
# Local cache for downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
```
```bash
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_CLI_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
```
The first run downloads the model weights, which may take several minutes depending on model size. Subsequent runs use the cached weights.
In a new terminal (or use `curl` from another machine):
```bash
curl -X POST 'http://localhost:8000/v1/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama3-8b-instruct",
"prompt": "Once upon a time",
"max_tokens": 225
}'
```
## Exposing Your NIM Endpoint
NIMs expose port 8000 by default. To make your NIM accessible externally:
### Option 1: Port Forwarding (Recommended for API Access)
Use port forwarding for direct API access without authentication:
```bash
# From your local machine
brev port-forward my-instance --port 8000:8000
# Access at localhost:8000
curl -X POST http://localhost:8000/v1/chat/completions ...
```
### Option 2: Cloudflare Tunnels (Web Console)
Expose through the web console for shareable URLs:
1. Go to your instance details in the Brev console
2. In the **Access** section, find **Using Tunnels**
3. Add port 8000 to create a public URL
**Cloudflare Authentication**: Tunnel URLs route through Cloudflare and require browser authentication on first access. For direct API access (e.g., from scripts or other services), use port forwarding instead.
## API Endpoints
NIMs provide several OpenAI-compatible endpoints:
| Endpoint | Description |
| ---------------------- | --------------------- |
| `/v1/completions` | Text completion |
| `/v1/chat/completions` | Chat completion |
| `/v1/models` | List available models |
| `/v1/health/ready` | Health check |
| `/v1/metrics` | Prometheus metrics |
## Available NIMs
Explore the full catalog at [build.nvidia.com](https://build.nvidia.com). Popular NIMs include:
* **LLMs**: Llama 3.1, Mistral, Mixtral, Nemotron
* **Vision**: SegFormer, CLIP
* **Speech**: Whisper, RIVA
* **Embedding**: NV-Embed
## Troubleshooting
**Permission errors**: If you encounter permission issues, try running with `sudo`:
```bash
sudo docker run -it --rm --name=$CONTAINER_NAME ...
```
**Out of memory**: Ensure your GPU has sufficient VRAM for the model. Check the [NIM support matrix](https://docs.nvidia.com/nim/large-language-models/latest/support-matrix.html) for requirements.