Serverless Deployments
Deploy NVIDIA Inference Microservices (NIMs) as serverless API endpoints that scale automatically based on demand.
OpenAI-compatible API: All deployments expose a Chat Completions API compatible with the OpenAI schema. Use your existing OpenAI client libraries by changing the base URL.
Key Features
Creating a Deployment
Select a Model
Choose from available NIMs:
- Llama 3.1 (8B, 70B, 405B)
- Nemotron (various sizes)
- Mistral and other supported models
Configure Scaling
Set your scaling parameters:
- Min workers: Minimum instances (0 for scale-to-zero)
- Max workers: Maximum instances for peak load
Monitoring
Metrics
The deployment details page shows:
Logs
View real-time logs from your deployment:
- Model loading status
- Inference requests
- Errors and warnings
Access logs from the Logs tab in the deployment details page.
API Integration
Endpoint URL
Each deployment gets a unique endpoint:
Using the OpenAI Python Client
Using curl
Getting Your API Key
- Navigate to your account settings in the Brev console
- Generate an API key under the API Keys section
- Store it securely—it won’t be shown again
Scaling Behavior
Scale-to-Zero
Set min_workers: 0 to scale to zero when idle:
- Pro: No cost when not in use
- Con: Cold start latency (30-60 seconds) for first request
Always-On
Set min_workers: 1 or higher for consistent latency:
- Pro: No cold starts, consistent response times
- Con: Continuous cost even during idle periods
Auto-Scale Triggers
Workers scale up when:
- Request queue exceeds threshold
- Response latency increases beyond target
Workers scale down when:
- Queue is empty for sustained period
- Current capacity exceeds demand
Cost Calculation
Deployment costs are based on:
Use scale-to-zero for development and testing. Switch to always-on for production workloads requiring consistent latency.
Frequently Asked Questions
Can I use my own fine-tuned model?
Currently, deployments support NVIDIA NIMs from the catalog. Custom model support is on the roadmap.
What’s the maximum request size?
The default context length depends on the model. Most models support up to 8K-128K tokens.
How do I update a deployment?
Modify settings (workers, GPU type) from the deployment details page. Changes take effect within minutes.
Can I deploy to multiple regions?
Deployments currently run in a single region. Multi-region support is planned.