Serverless Deployments

View as Markdown

Deploy NVIDIA Inference Microservices (NIMs) as serverless API endpoints that scale automatically based on demand.

OpenAI-compatible API: All deployments expose a Chat Completions API compatible with the OpenAI schema. Use your existing OpenAI client libraries by changing the base URL.

Key Features

FeatureDescription
ServerlessNo infrastructure to manage—deploy models with a few clicks.
Auto-scalingScale from 0 to N instances based on traffic.
Pay-per-useOnly pay for compute time when processing requests.
GPU-optimizedNIMs are optimized for NVIDIA GPUs with TensorRT-LLM.

Creating a Deployment

2

Click Create Deployment

Click the Create Deployment button.

3

Select a Model

Choose from available NIMs:

  • Llama 3.1 (8B, 70B, 405B)
  • Nemotron (various sizes)
  • Mistral and other supported models
4

Configure Scaling

Set your scaling parameters:

  • Min workers: Minimum instances (0 for scale-to-zero)
  • Max workers: Maximum instances for peak load
5

Select GPU Type

Choose the GPU type based on model size:

  • L40S: Smaller models (8B parameters)
  • A100: Medium models (70B parameters)
  • H100: Large models (405B+ parameters)
6

Deploy

Click Create to start the deployment. Initial provisioning takes 2-5 minutes.

Monitoring

Metrics

The deployment details page shows:

MetricDescription
InvocationsTotal API calls over time
Latency (p50/p95/p99)Response time percentiles
Error RatePercentage of failed requests
Active WorkersCurrent number of running instances

Logs

View real-time logs from your deployment:

  • Model loading status
  • Inference requests
  • Errors and warnings

Access logs from the Logs tab in the deployment details page.

API Integration

Endpoint URL

Each deployment gets a unique endpoint:

https://api.brev.dev/v1/deployments/<deployment-id>

Using the OpenAI Python Client

1from openai import OpenAI
2
3client = OpenAI(
4 base_url="https://api.brev.dev/v1/deployments/<deployment-id>",
5 api_key="<your-brev-api-key>"
6)
7
8response = client.chat.completions.create(
9 model="meta/llama-3.1-8b-instruct",
10 messages=[
11 {"role": "user", "content": "What is machine learning?"}
12 ]
13)
14
15print(response.choices[0].message.content)

Using curl

$curl https://api.brev.dev/v1/deployments/<deployment-id>/chat/completions \
> -H "Authorization: Bearer <your-brev-api-key>" \
> -H "Content-Type: application/json" \
> -d '{
> "model": "meta/llama-3.1-8b-instruct",
> "messages": [{"role": "user", "content": "Hello!"}]
> }'

Getting Your API Key

  1. Navigate to your account settings in the Brev console
  2. Generate an API key under the API Keys section
  3. Store it securely—it won’t be shown again

Scaling Behavior

Scale-to-Zero

Set min_workers: 0 to scale to zero when idle:

  • Pro: No cost when not in use
  • Con: Cold start latency (30-60 seconds) for first request

Always-On

Set min_workers: 1 or higher for consistent latency:

  • Pro: No cold starts, consistent response times
  • Con: Continuous cost even during idle periods

Auto-Scale Triggers

Workers scale up when:

  • Request queue exceeds threshold
  • Response latency increases beyond target

Workers scale down when:

  • Queue is empty for sustained period
  • Current capacity exceeds demand

Cost Calculation

Deployment costs are based on:

FactorBilling
GPU timePer-second billing while workers are active.
GPU typeDifferent rates for L40S, A100, H100.
Worker countCost multiplied by number of active workers.

Use scale-to-zero for development and testing. Switch to always-on for production workloads requiring consistent latency.

Frequently Asked Questions

Can I use my own fine-tuned model?

Currently, deployments support NVIDIA NIMs from the catalog. Custom model support is on the roadmap.

What’s the maximum request size?

The default context length depends on the model. Most models support up to 8K-128K tokens.

How do I update a deployment?

Modify settings (workers, GPU type) from the deployment details page. Changes take effect within minutes.

Can I deploy to multiple regions?

Deployments currently run in a single region. Multi-region support is planned.

What’s Next