Deploying NVIDIA NIMs
Run optimized NVIDIA Inference Microservices (NIMs) for production-ready AI inference on your Brev instance.
What are NIMs?
NVIDIA Inference Microservices (NIMs) are pre-built, optimized containers for running AI models with industry-standard APIs. They include:
- CUDA, TensorRT, TensorRT-LLM, and Triton Inference Server
- OpenAI-compatible API endpoints
- Preconfigured for production workloads
A NIM container provides an interactive API for blazing fast inference. Deploying a large language model NIM requires the NIM container (API, server, runtime) and the model engine.
Prerequisites
- NVIDIA NGC API key (get one at ngc.nvidia.com)
- GPU with sufficient VRAM—see the NIM support matrix for requirements
- Recommended: L40S 48GB or A100 80GB GPU
Setting Up Your Instance
NGC Authentication
Using the NGC CLI (Optional)
The NGC CLI provides additional functionality for browsing and managing NIMs.
Refer to the NGC CLI documentation for more details.
Deploying a NIM
This example deploys the Llama 3 8B Instruct NIM. The same pattern applies to other NIMs.
Exposing Your NIM Endpoint
NIMs expose port 8000 by default. To make your NIM accessible externally:
Option 1: Port Forwarding (Recommended for API Access)
Use port forwarding for direct API access without authentication:
Option 2: Cloudflare Tunnels (Web Console)
Expose through the web console for shareable URLs:
- Go to your instance details in the Brev console
- In the Access section, find Using Tunnels
- Add port 8000 to create a public URL
Cloudflare Authentication: Tunnel URLs route through Cloudflare and require browser authentication on first access. For direct API access (e.g., from scripts or other services), use port forwarding instead.
API Endpoints
NIMs provide several OpenAI-compatible endpoints:
Available NIMs
Explore the full catalog at build.nvidia.com. Popular NIMs include:
- LLMs: Llama 3.1, Mistral, Mixtral, Nemotron
- Vision: SegFormer, CLIP
- Speech: Whisper, RIVA
- Embedding: NV-Embed
Troubleshooting
Permission errors: If you encounter permission issues, try running with sudo:
Out of memory: Ensure your GPU has sufficient VRAM for the model. Check the NIM support matrix for requirements.