*** title: Deploying NVIDIA NIMs description: Deploy NVIDIA Inference Microservices (NIMs) on your Brev GPU instance. ------------------------------------------------------------------------------------ Run optimized NVIDIA Inference Microservices (NIMs) for production-ready AI inference on your Brev instance. ## What are NIMs? NVIDIA Inference Microservices (NIMs) are pre-built, optimized containers for running AI models with industry-standard APIs. They include: * CUDA, TensorRT, TensorRT-LLM, and Triton Inference Server * OpenAI-compatible API endpoints * Preconfigured for production workloads A NIM container provides an interactive API for blazing fast inference. Deploying a large language model NIM requires the NIM container (API, server, runtime) and the model engine. ## Prerequisites * NVIDIA NGC API key (get one at [ngc.nvidia.com](https://ngc.nvidia.com)) * GPU with sufficient VRAM—see the [NIM support matrix](https://docs.nvidia.com/nim/large-language-models/latest/support-matrix.html) for requirements * Recommended: L40S 48GB or A100 80GB GPU ## Setting Up Your Instance In the Brev console, create a new instance: * Select **VM Mode** (not container mode) * Choose an appropriate GPU (L40S or A100 recommended) * Click **Deploy** ```bash brev shell my-instance ``` ```bash docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi ``` ## NGC Authentication ```bash export NGC_CLI_API_KEY= ``` Add to your shell profile so it's available on restart: ```bash # For bash echo "export NGC_CLI_API_KEY=" >> ~/.bashrc # For zsh echo "export NGC_CLI_API_KEY=" >> ~/.zshrc ``` ```bash echo "$NGC_CLI_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin ``` ## Using the NGC CLI (Optional) The NGC CLI provides additional functionality for browsing and managing NIMs. ```bash # Install NGC CLI pip install ngc-cli # Configure with your API key ngc config set # List available NIMs ngc registry image list --format_type csv nvcr.io/nim/meta/* ``` Refer to the [NGC CLI documentation](https://docs.ngc.nvidia.com/cli/index.html) for more details. ## Deploying a NIM This example deploys the Llama 3 8B Instruct NIM. The same pattern applies to other NIMs. ```bash # Container name for bookkeeping export CONTAINER_NAME=Llama3-8B-Instruct # NIM image (check NGC for latest versions) export IMG_NAME="nvcr.io/nim/meta/llama3-8b-instruct:1.0" # Local cache for downloaded models export LOCAL_NIM_CACHE=~/.cache/nim mkdir -p "$LOCAL_NIM_CACHE" ``` ```bash docker run -it --rm --name=$CONTAINER_NAME \ --runtime=nvidia \ --gpus all \ --shm-size=16GB \ -e NGC_CLI_API_KEY \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -u $(id -u) \ -p 8000:8000 \ $IMG_NAME ``` The first run downloads the model weights, which may take several minutes depending on model size. Subsequent runs use the cached weights. In a new terminal (or use `curl` from another machine): ```bash curl -X POST 'http://localhost:8000/v1/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "meta-llama3-8b-instruct", "prompt": "Once upon a time", "max_tokens": 225 }' ``` ## Exposing Your NIM Endpoint NIMs expose port 8000 by default. To make your NIM accessible externally: ### Option 1: Port Forwarding (Recommended for API Access) Use port forwarding for direct API access without authentication: ```bash # From your local machine brev port-forward my-instance --port 8000:8000 # Access at localhost:8000 curl -X POST http://localhost:8000/v1/chat/completions ... ``` ### Option 2: Cloudflare Tunnels (Web Console) Expose through the web console for shareable URLs: 1. Go to your instance details in the Brev console 2. In the **Access** section, find **Using Tunnels** 3. Add port 8000 to create a public URL **Cloudflare Authentication**: Tunnel URLs route through Cloudflare and require browser authentication on first access. For direct API access (e.g., from scripts or other services), use port forwarding instead. ## API Endpoints NIMs provide several OpenAI-compatible endpoints: | Endpoint | Description | | ---------------------- | --------------------- | | `/v1/completions` | Text completion | | `/v1/chat/completions` | Chat completion | | `/v1/models` | List available models | | `/v1/health/ready` | Health check | | `/v1/metrics` | Prometheus metrics | ## Available NIMs Explore the full catalog at [build.nvidia.com](https://build.nvidia.com). Popular NIMs include: * **LLMs**: Llama 3.1, Mistral, Mixtral, Nemotron * **Vision**: SegFormer, CLIP * **Speech**: Whisper, RIVA * **Embedding**: NV-Embed ## Troubleshooting **Permission errors**: If you encounter permission issues, try running with `sudo`: ```bash sudo docker run -it --rm --name=$CONTAINER_NAME ... ``` **Out of memory**: Ensure your GPU has sufficient VRAM for the model. Check the [NIM support matrix](https://docs.nvidia.com/nim/large-language-models/latest/support-matrix.html) for requirements.