***
title: Serverless Deployments
description: >-
Deploy NVIDIA Inference Microservices (NIMs) as serverless endpoints with
automatic scaling.
------------------
Deploy NVIDIA Inference Microservices (NIMs) as serverless API endpoints that scale automatically based on demand.
**OpenAI-compatible API**: All deployments expose a Chat Completions API compatible with the OpenAI schema. Use your existing OpenAI client libraries by changing the base URL.
## Key Features
| Feature | Description |
| ----------------- | ------------------------------------------------------------ |
| **Serverless** | No infrastructure to manage—deploy models with a few clicks. |
| **Auto-scaling** | Scale from 0 to N instances based on traffic. |
| **Pay-per-use** | Only pay for compute time when processing requests. |
| **GPU-optimized** | NIMs are optimized for NVIDIA GPUs with TensorRT-LLM. |
## Creating a Deployment
From the Brev console, click **Deployments** in the sidebar.
Click the **Create Deployment** button.
Choose from available NIMs:
* **Llama 3.1** (8B, 70B, 405B)
* **Nemotron** (various sizes)
* **Mistral** and other supported models
Set your scaling parameters:
* **Min workers**: Minimum instances (0 for scale-to-zero)
* **Max workers**: Maximum instances for peak load
Choose the GPU type based on model size:
* **L40S**: Smaller models (8B parameters)
* **A100**: Medium models (70B parameters)
* **H100**: Large models (405B+ parameters)
Click **Create** to start the deployment. Initial provisioning takes 2-5 minutes.
## Monitoring
### Metrics
The deployment details page shows:
| Metric | Description |
| ------------------------- | ----------------------------------- |
| **Invocations** | Total API calls over time |
| **Latency (p50/p95/p99)** | Response time percentiles |
| **Error Rate** | Percentage of failed requests |
| **Active Workers** | Current number of running instances |
### Logs
View real-time logs from your deployment:
* Model loading status
* Inference requests
* Errors and warnings
Access logs from the **Logs** tab in the deployment details page.
## API Integration
### Endpoint URL
Each deployment gets a unique endpoint:
```
https://api.brev.dev/v1/deployments/
```
### Using the OpenAI Python Client
```python
from openai import OpenAI
client = OpenAI(
base_url="https://api.brev.dev/v1/deployments/",
api_key=""
)
response = client.chat.completions.create(
model="meta/llama-3.1-8b-instruct",
messages=[
{"role": "user", "content": "What is machine learning?"}
]
)
print(response.choices[0].message.content)
```
### Using curl
```bash
curl https://api.brev.dev/v1/deployments//chat/completions \
-H "Authorization: Bearer " \
-H "Content-Type: application/json" \
-d '{
"model": "meta/llama-3.1-8b-instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```
### Getting Your API Key
1. Navigate to your account settings in the Brev console
2. Generate an API key under the **API Keys** section
3. Store it securely—it won't be shown again
## Scaling Behavior
### Scale-to-Zero
Set `min_workers: 0` to scale to zero when idle:
* **Pro**: No cost when not in use
* **Con**: Cold start latency (30-60 seconds) for first request
### Always-On
Set `min_workers: 1` or higher for consistent latency:
* **Pro**: No cold starts, consistent response times
* **Con**: Continuous cost even during idle periods
### Auto-Scale Triggers
Workers scale up when:
* Request queue exceeds threshold
* Response latency increases beyond target
Workers scale down when:
* Queue is empty for sustained period
* Current capacity exceeds demand
## Cost Calculation
Deployment costs are based on:
| Factor | Billing |
| ---------------- | -------------------------------------------- |
| **GPU time** | Per-second billing while workers are active. |
| **GPU type** | Different rates for L40S, A100, H100. |
| **Worker count** | Cost multiplied by number of active workers. |
Use scale-to-zero for development and testing. Switch to always-on for production workloads requiring consistent latency.
## Frequently Asked Questions
**Can I use my own fine-tuned model?**
Currently, deployments support NVIDIA NIMs from the catalog. Custom model support is on the roadmap.
**What's the maximum request size?**
The default context length depends on the model. Most models support up to 8K-128K tokens.
**How do I update a deployment?**
Modify settings (workers, GPU type) from the deployment details page. Changes take effect within minutes.
**Can I deploy to multiple regions?**
Deployments currently run in a single region. Multi-region support is planned.
## What's Next
Step-by-step guide for deploying specific NIM models.
Compare GPU options for your deployment.