*** title: Serverless Deployments description: >- Deploy NVIDIA Inference Microservices (NIMs) as serverless endpoints with automatic scaling. ------------------ Deploy NVIDIA Inference Microservices (NIMs) as serverless API endpoints that scale automatically based on demand. **OpenAI-compatible API**: All deployments expose a Chat Completions API compatible with the OpenAI schema. Use your existing OpenAI client libraries by changing the base URL. ## Key Features | Feature | Description | | ----------------- | ------------------------------------------------------------ | | **Serverless** | No infrastructure to manage—deploy models with a few clicks. | | **Auto-scaling** | Scale from 0 to N instances based on traffic. | | **Pay-per-use** | Only pay for compute time when processing requests. | | **GPU-optimized** | NIMs are optimized for NVIDIA GPUs with TensorRT-LLM. | ## Creating a Deployment From the Brev console, click **Deployments** in the sidebar. Click the **Create Deployment** button. Choose from available NIMs: * **Llama 3.1** (8B, 70B, 405B) * **Nemotron** (various sizes) * **Mistral** and other supported models Set your scaling parameters: * **Min workers**: Minimum instances (0 for scale-to-zero) * **Max workers**: Maximum instances for peak load Choose the GPU type based on model size: * **L40S**: Smaller models (8B parameters) * **A100**: Medium models (70B parameters) * **H100**: Large models (405B+ parameters) Click **Create** to start the deployment. Initial provisioning takes 2-5 minutes. ## Monitoring ### Metrics The deployment details page shows: | Metric | Description | | ------------------------- | ----------------------------------- | | **Invocations** | Total API calls over time | | **Latency (p50/p95/p99)** | Response time percentiles | | **Error Rate** | Percentage of failed requests | | **Active Workers** | Current number of running instances | ### Logs View real-time logs from your deployment: * Model loading status * Inference requests * Errors and warnings Access logs from the **Logs** tab in the deployment details page. ## API Integration ### Endpoint URL Each deployment gets a unique endpoint: ``` https://api.brev.dev/v1/deployments/ ``` ### Using the OpenAI Python Client ```python from openai import OpenAI client = OpenAI( base_url="https://api.brev.dev/v1/deployments/", api_key="" ) response = client.chat.completions.create( model="meta/llama-3.1-8b-instruct", messages=[ {"role": "user", "content": "What is machine learning?"} ] ) print(response.choices[0].message.content) ``` ### Using curl ```bash curl https://api.brev.dev/v1/deployments//chat/completions \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ -d '{ "model": "meta/llama-3.1-8b-instruct", "messages": [{"role": "user", "content": "Hello!"}] }' ``` ### Getting Your API Key 1. Navigate to your account settings in the Brev console 2. Generate an API key under the **API Keys** section 3. Store it securely—it won't be shown again ## Scaling Behavior ### Scale-to-Zero Set `min_workers: 0` to scale to zero when idle: * **Pro**: No cost when not in use * **Con**: Cold start latency (30-60 seconds) for first request ### Always-On Set `min_workers: 1` or higher for consistent latency: * **Pro**: No cold starts, consistent response times * **Con**: Continuous cost even during idle periods ### Auto-Scale Triggers Workers scale up when: * Request queue exceeds threshold * Response latency increases beyond target Workers scale down when: * Queue is empty for sustained period * Current capacity exceeds demand ## Cost Calculation Deployment costs are based on: | Factor | Billing | | ---------------- | -------------------------------------------- | | **GPU time** | Per-second billing while workers are active. | | **GPU type** | Different rates for L40S, A100, H100. | | **Worker count** | Cost multiplied by number of active workers. | Use scale-to-zero for development and testing. Switch to always-on for production workloads requiring consistent latency. ## Frequently Asked Questions **Can I use my own fine-tuned model?** Currently, deployments support NVIDIA NIMs from the catalog. Custom model support is on the roadmap. **What's the maximum request size?** The default context length depends on the model. Most models support up to 8K-128K tokens. **How do I update a deployment?** Modify settings (workers, GPU type) from the deployment details page. Changes take effect within minutes. **Can I deploy to multiple regions?** Deployments currently run in a single region. Multi-region support is planned. ## What's Next Step-by-step guide for deploying specific NIM models. Compare GPU options for your deployment.