> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/brev/llms.txt.
> For full documentation content, see https://docs.nvidia.com/brev/llms-full.txt.

# Deploying NVIDIA NIMs

> Deploy NVIDIA Inference Microservices (NIMs) on your Brev GPU instance.

Run optimized NVIDIA Inference Microservices (NIMs) for production-ready AI inference on your Brev instance.

## What are NIMs?

NVIDIA Inference Microservices (NIMs) are pre-built, optimized containers for running AI models with industry-standard APIs. They include:

* CUDA, TensorRT, TensorRT-LLM, and Triton Inference Server
* OpenAI-compatible API endpoints
* Preconfigured for production workloads

A NIM container provides an interactive API for blazing fast inference. Deploying a large language model NIM requires the NIM container (API, server, runtime) and the model engine.

## Prerequisites

* NVIDIA NGC API key (get one at [ngc.nvidia.com](https://ngc.nvidia.com))
* GPU with sufficient VRAM—see the [NIM support matrix](https://docs.nvidia.com/nim/large-language-models/latest/support-matrix.html) for requirements
* Recommended: L40S 48GB or A100 80GB GPU

## Setting Up Your Instance

<Steps>
  <Step title="Create a VM Mode Instance">
    In the Brev console, create a new instance:

    * Select **VM Mode** (not container mode)
    * Choose an appropriate GPU (L40S or A100 recommended)
    * Click **Deploy**
  </Step>

  <Step title="Connect to Your Instance">
    ```bash
    brev shell my-instance
    ```
  </Step>

  <Step title="Verify GPU Setup">
    ```bash
    docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
    ```
  </Step>
</Steps>

## NGC Authentication

<Steps>
  <Step title="Set Your NGC API Key">
    ```bash
    export NGC_CLI_API_KEY=<your-ngc-api-key>
    ```
  </Step>

  <Step title="Persist the Key (Optional)">
    Add to your shell profile so it's available on restart:

    ```bash
    # For bash
    echo "export NGC_CLI_API_KEY=<your-ngc-api-key>" >> ~/.bashrc

    # For zsh
    echo "export NGC_CLI_API_KEY=<your-ngc-api-key>" >> ~/.zshrc
    ```
  </Step>

  <Step title="Login to NGC Container Registry">
    ```bash
    echo "$NGC_CLI_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
    ```
  </Step>
</Steps>

## Using the NGC CLI (Optional)

The NGC CLI provides additional functionality for browsing and managing NIMs.

```bash
# Install NGC CLI
pip install ngc-cli

# Configure with your API key
ngc config set

# List available NIMs
ngc registry image list --format_type csv nvcr.io/nim/meta/*
```

Refer to the [NGC CLI documentation](https://docs.ngc.nvidia.com/cli/index.html) for more details.

## Deploying a NIM

This example deploys the Llama 3 8B Instruct NIM. The same pattern applies to other NIMs.

<Steps>
  <Step title="Set Up Environment Variables">
    ```bash
    # Container name for bookkeeping
    export CONTAINER_NAME=Llama3-8B-Instruct

    # NIM image (check NGC for latest versions)
    export IMG_NAME="nvcr.io/nim/meta/llama3-8b-instruct:1.0"

    # Local cache for downloaded models
    export LOCAL_NIM_CACHE=~/.cache/nim
    mkdir -p "$LOCAL_NIM_CACHE"
    ```
  </Step>

  <Step title="Run the NIM Container">
    ```bash
    docker run -it --rm --name=$CONTAINER_NAME \
      --runtime=nvidia \
      --gpus all \
      --shm-size=16GB \
      -e NGC_CLI_API_KEY \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
      -u $(id -u) \
      -p 8000:8000 \
      $IMG_NAME
    ```

    <Callout intent="info">
      The first run downloads the model weights, which may take several minutes depending on model size. Subsequent runs use the cached weights.
    </Callout>
  </Step>

  <Step title="Test the API">
    In a new terminal (or use `curl` from another machine):

    ```bash
    curl -X POST 'http://localhost:8000/v1/completions' \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
      -d '{
        "model": "meta-llama3-8b-instruct",
        "prompt": "Once upon a time",
        "max_tokens": 225
      }'
    ```
  </Step>
</Steps>

## Exposing Your NIM Endpoint

NIMs expose port 8000 by default. To make your NIM accessible externally:

### Option 1: Port Forwarding (Recommended for API Access)

Use port forwarding for direct API access without authentication:

```bash
# From your local machine
brev port-forward my-instance --port 8000:8000

# Access at localhost:8000
curl -X POST http://localhost:8000/v1/chat/completions ...
```

### Option 2: Cloudflare Tunnels (Web Console)

Expose through the web console for shareable URLs:

1. Go to your instance details in the Brev console
2. In the **Access** section, find **Using Tunnels**
3. Add port 8000 to create a public URL

<Callout intent="warning">
  **Cloudflare Authentication**: Tunnel URLs route through Cloudflare and require browser authentication on first access. For direct API access (e.g., from scripts or other services), use port forwarding instead.
</Callout>

## API Endpoints

NIMs provide several OpenAI-compatible endpoints:

| Endpoint               | Description           |
| ---------------------- | --------------------- |
| `/v1/completions`      | Text completion       |
| `/v1/chat/completions` | Chat completion       |
| `/v1/models`           | List available models |
| `/v1/health/ready`     | Health check          |
| `/v1/metrics`          | Prometheus metrics    |

## Available NIMs

Explore the full catalog at [build.nvidia.com](https://build.nvidia.com). Popular NIMs include:

* **LLMs**: Llama 3.1, Mistral, Mixtral, Nemotron
* **Vision**: SegFormer, CLIP
* **Speech**: Whisper, RIVA
* **Embedding**: NV-Embed

## Troubleshooting

**Permission errors**: If you encounter permission issues, try running with `sudo`:

```bash
sudo docker run -it --rm --name=$CONTAINER_NAME ...
```

**Out of memory**: Ensure your GPU has sufficient VRAM for the model. Check the [NIM support matrix](https://docs.nvidia.com/nim/large-language-models/latest/support-matrix.html) for requirements.