Serverless Deployments | NVIDIA Brev Documentation

Deploy NVIDIA Inference Microservices (NIMs) as serverless API endpoints that scale automatically based on demand.

OpenAI-compatible API: All deployments expose a Chat Completions API compatible with the OpenAI schema. Use your existing OpenAI client libraries by changing the base URL.

Key Features

Feature	Description
Serverless	No infrastructure to manage—deploy models with a few clicks.
Auto-scaling	Scale from 0 to N instances based on traffic.
Pay-per-use	Only pay for compute time when processing requests.
GPU-optimized	NIMs are optimized for NVIDIA GPUs with TensorRT-LLM.

Creating a Deployment

Navigate to Deployments

From the Brev console, click Deployments in the sidebar.

Click Create Deployment

Click the Create Deployment button.

Select a Model

Choose from available NIMs:

Llama 3.1 (8B, 70B, 405B)
Nemotron (various sizes)
Mistral and other supported models

Configure Scaling

Set your scaling parameters:

Min workers: Minimum instances (0 for scale-to-zero)
Max workers: Maximum instances for peak load

Select GPU Type

Choose the GPU type based on model size:

L40S: Smaller models (8B parameters)
A100: Medium models (70B parameters)
H100: Large models (405B+ parameters)

Deploy

Click Create to start the deployment. Initial provisioning takes 2-5 minutes.

Monitoring

Metrics

The deployment details page shows:

Metric	Description
Invocations	Total API calls over time
Latency (p50/p95/p99)	Response time percentiles
Error Rate	Percentage of failed requests
Active Workers	Current number of running instances

Logs

View real-time logs from your deployment:

Model loading status
Inference requests
Errors and warnings

Access logs from the Logs tab in the deployment details page.

API Integration

Endpoint URL

Each deployment gets a unique endpoint:

https://api.brev.dev/v1/deployments/<deployment-id>

Using the OpenAI Python Client

1 from openai import OpenAI
2 
3 client = OpenAI(
4     base_url="https://api.brev.dev/v1/deployments/<deployment-id>",
5     api_key="<your-brev-api-key>"
6 )
7 
8 response = client.chat.completions.create(
9     model="meta/llama-3.1-8b-instruct",
10     messages=[
11         {"role": "user", "content": "What is machine learning?"}
12     ]
13 )
14 
15 print(response.choices[0].message.content)

Using curl

$ curl https://api.brev.dev/v1/deployments/<deployment-id>/chat/completions \
>   -H "Authorization: Bearer <your-brev-api-key>" \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "meta/llama-3.1-8b-instruct",
>     "messages": [{"role": "user", "content": "Hello!"}]
>   }'

Getting Your API Key

Navigate to your account settings in the Brev console
Generate an API key under the API Keys section
Store it securely—it won’t be shown again

Scaling Behavior

Scale-to-Zero

Set min_workers: 0 to scale to zero when idle:

Pro: No cost when not in use
Con: Cold start latency (30-60 seconds) for first request

Always-On

Set min_workers: 1 or higher for consistent latency:

Pro: No cold starts, consistent response times
Con: Continuous cost even during idle periods

Auto-Scale Triggers

Workers scale up when:

Request queue exceeds threshold
Response latency increases beyond target

Workers scale down when:

Queue is empty for sustained period
Current capacity exceeds demand

Cost Calculation

Deployment costs are based on:

Factor	Billing
GPU time	Per-second billing while workers are active.
GPU type	Different rates for L40S, A100, H100.
Worker count	Cost multiplied by number of active workers.

Use scale-to-zero for development and testing. Switch to always-on for production workloads requiring consistent latency.

Frequently Asked Questions

Can I use my own fine-tuned model?

Currently, deployments support NVIDIA NIMs from the catalog. Custom model support is on the roadmap.

What’s the maximum request size?

The default context length depends on the model. Most models support up to 8K-128K tokens.

How do I update a deployment?

Modify settings (workers, GPU type) from the deployment details page. Changes take effect within minutes.

Can I deploy to multiple regions?

Deployments currently run in a single region. Multi-region support is planned.

What’s Next

Deploying NIMs

Step-by-step guide for deploying specific NIM models.

GPU Types

Compare GPU options for your deployment.