Deployments#

Overview#

Brev Deployments is a new feature that allows you to easily deploy NVIDIA Inference Microservices (NIMs) as serverless functions using NVIDIA Cloud Functions (NVCF). With Deployments, you can quickly spin up and scale large language models (LLMs) and other NIMs, monitor their performance, and integrate them into your applications with minimal effort.

Key Features#

Serverless NIM Hosting: Deploy NIMs (currently LLMs) as serverless endpoints, managed by NVIDIA Cloud Functions.
Scalable Inference: Set the minimum and maximum number of workers (nodes) to automatically scale with your application’s needs.
Cost Estimation: Instantly calculate the cost of hosting your model based on the number of workers and GPU type before deploying.
Metrics & Logs: Access real-time metrics and logs for your deployments to monitor usage and performance.
Code Snippets: Get ready-to-use code snippets (Python, Node, Shell, etc.) to integrate your deployment into your application.
Chat Interface: Interact with your privately hosted model directly from the Brev console.

How It Works#

Create a Deployment:
- Go to the Deployments tab in the Brev console.
- Click “Create New Deployment”.
- Select a model (e.g., Llama 3, Nemotron, etc.).
- Choose your GPU type and set the minimum and maximum number of instances (workers).
- Name your deployment and review the cost calculation.
- Click “Deploy” to launch your NIM endpoint.
Monitor & Manage:
- View invocation activity and instance usage in real time.
- Access logs for debugging and performance monitoring.
- Adjust scaling parameters as needed.
Integrate with Your Application:
- Copy code snippets in your preferred language from the deployment details page.
- Use the provided API endpoint and key to connect your app to your private NIM deployment.
- Example (Python):

from openai import OpenAI

client = OpenAI(
    base_url = "https://api.brev.dev/v1",
    api_key = "<your-api-key>"
)

completion = client.chat.completions.create(
    model="<your-deployment-model-id>",
    messages=[{"role": "user", "content": "Write a limerick about GPUs."}],
    temperature=0.2,
    top_p=0.7,
    max_tokens=1024,
    stream=True,
)

for chunk in completion:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Note: Brev Deployments use the Chat Completions API schema, making it easy to integrate with existing tools and libraries that support this standard. You can use familiar parameters such as model, messages, temperature, top_p, and more.

Cost Calculation#

The Deployments UI provides a real-time cost estimate based on your selected GPU type and the number of instances.
You can set both minimum and maximum instances to control scaling and cost.
Example: If you select 4x H100 GPUs and set min=0, max=1, the cost per instance is displayed, along with the minimum and maximum daily cost.

Benefits#

Fast & Easy: Deploy production-grade LLMs and other NIMs in minutes.
Scalable: Automatically scale up or down based on demand.
Transparent Pricing: Know your costs before you deploy.
Integrated Monitoring: Built-in metrics and logs for easy management.
Developer Friendly: Copy-paste code snippets for instant integration.

FAQ#

Q: What models can I deploy? A: Currently, Deployments supports a range of LLM NIMs (e.g., Llama 3, Nemotron, etc.). More will be added soon (focus is on additional LLM NIMs, VLMs, and other multimodal models).

Q: How do I access my deployed endpoint? A: After deployment, you’ll receive an endpoint URL and API key. Use these in your application or with the provided code snippets. All endpoints follow the Chat Completions API schema for maximum compatibility.

Q: How do I monitor my deployment? A: Metrics and logs are available in the deployment dashboard for real-time monitoring and debugging.

Get Started#

Head to the Deployments tab in the Brev console and launch your first NIM today!

For more details, see the official documentation or reach out to our support team.