Frontend

View as Markdown

The Dynamo Frontend is the API gateway for serving LLM inference requests. It provides OpenAI-compatible HTTP endpoints and KServe gRPC endpoints, handling request preprocessing, routing, and response formatting.

Feature Matrix

FeatureStatus
OpenAI Chat Completions API (/v1/chat/completions)✅ Supported
OpenAI Completions API (/v1/completions)✅ Supported
OpenAI Embeddings API (/v1/embeddings)✅ Supported
OpenAI Responses API (/v1/responses)✅ Supported
OpenAI Models API (/v1/models)✅ Supported
Image Generation (/v1/images/generations)✅ Supported
Video Generation (/v1/videos/generations)✅ Supported
Anthropic Messages API (/v1/messages)🧪 Experimental
KServe gRPC v2 API✅ Supported
Streaming responses (SSE)✅ Supported
Multi-model serving✅ Supported
Integrated KV-aware routing✅ Supported
Tool calling✅ Supported
TLS (HTTPS)✅ Supported
Swagger UI (/docs)✅ Supported
NVIDIA request extensions (nvext)✅ Supported

Quick Start

Prerequisites

  • Dynamo platform installed
  • etcd and nats-server -js running
  • At least one backend worker registered

HTTP Frontend

$python -m dynamo.frontend --http-port 8000

This starts an OpenAI-compatible HTTP server with integrated pre/post processing and routing. Backends are auto-discovered when they call register_model.

The frontend does the pre and post processing. To do this it will need access to the model configuration files: config.json, tokenizer.json, tokenizer_config.json, etc. It does not need the weights.

Frontend will download the files it needs from Hugging Face, no setup is required. However we recommend setting up modelexpress-server and a shared folder such as a Kubernetes PVC. This ensures the model is only downloaded once across the whole cluster.

If the model is not available on Hugging Face, such as a private or customized model, you will need to make the model files available locally at the same file path as on the backend. The backend’s --model-path <here> will need to exist on the frontend and contain at least the configuration (JSON) files.

KServe gRPC Frontend

$python -m dynamo.frontend --kserve-grpc-server

See the Frontend Guide for KServe-specific configuration and message formats.

Kubernetes

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: frontend-example
5spec:
6 graphs:
7 - name: frontend
8 replicas: 1
9 services:
10 - name: Frontend
11 image: nvcr.io/nvidia/dynamo/dynamo-vllm:latest
12 command:
13 - python
14 - -m
15 - dynamo.frontend
16 - --http-port
17 - "8000"

Configuration

ParameterDefaultDescription
--http-port8000HTTP server port
--kserve-grpc-serverfalseEnable KServe gRPC server
--router-moderound-robinRouting strategy: round-robin, random, kv, direct

See the Frontend Guide for full configuration options.

Next Steps

DocumentDescription
Configuration ReferenceAll CLI arguments, env vars, and HTTP endpoints
Frontend GuideKServe gRPC configuration and integration
NVIDIA Request Extensions (nvext)Custom request fields for routing hints and cache control
Router DocumentationKV-aware routing configuration