Frontend | NVIDIA Dynamo Documentation

The Dynamo Frontend is the API gateway for serving LLM inference requests. It provides OpenAI-compatible HTTP endpoints and KServe gRPC endpoints, handling request preprocessing, routing, and response formatting.

Feature Matrix

Feature	Status
OpenAI Chat Completions API (`/v1/chat/completions`)	✅ Supported
OpenAI Completions API (`/v1/completions`)	✅ Supported
OpenAI Embeddings API (`/v1/embeddings`)	✅ Supported
OpenAI Responses API (`/v1/responses`)	✅ Supported
OpenAI Models API (`/v1/models`)	✅ Supported
Image Generation (`/v1/images/generations`)	✅ Supported
Video Generation (`/v1/videos/generations`)	✅ Supported
Anthropic Messages API (`/v1/messages`)	🧪 Experimental
KServe gRPC v2 API	✅ Supported
Streaming responses (SSE)	✅ Supported
Multi-model serving	✅ Supported
Integrated KV-aware routing	✅ Supported
Tool calling	✅ Supported
TLS (HTTPS)	✅ Supported
Swagger UI (`/docs`)	✅ Supported
NVIDIA request extensions (`nvext`)	✅ Supported

Quick Start

Prerequisites

Dynamo platform installed
etcd and nats-server -js running
At least one backend worker registered

HTTP Frontend

$ python -m dynamo.frontend --http-port 8000

This starts an OpenAI-compatible HTTP server with integrated pre/post processing and routing. Backends are auto-discovered when they call register_model.

The frontend does the pre and post processing. To do this it will need access to the model configuration files: config.json, tokenizer.json, tokenizer_config.json, etc. It does not need the weights.

Frontend will download the files it needs from Hugging Face, no setup is required. However we recommend setting up modelexpress-server and a shared folder such as a Kubernetes PVC. This ensures the model is only downloaded once across the whole cluster.

If the model is not available on Hugging Face, such as a private or customized model, you will need to make the model files available locally at the same file path as on the backend. The backend’s --model-path <here> will need to exist on the frontend and contain at least the configuration (JSON) files.

KServe gRPC Frontend

$ python -m dynamo.frontend --kserve-grpc-server

See the Frontend Guide for KServe-specific configuration and message formats.

Kubernetes

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 metadata:
4   name: frontend-example
5 spec:
6   graphs:
7     - name: frontend
8       replicas: 1
9       services:
10         - name: Frontend
11           image: nvcr.io/nvidia/dynamo/dynamo-vllm:latest
12           command:
13             - python
14             - -m
15             - dynamo.frontend
16             - --http-port
17             - "8000"

Configuration

Parameter	Default	Description
`--http-port`	8000	HTTP server port
`--kserve-grpc-server`	false	Enable KServe gRPC server
`--router-mode`	`round-robin`	Routing strategy: `round-robin`, `random`, `kv`, `direct`, `least-loaded`, `device-aware-weighted` (`power-of-two` and `least-loaded` use synchronous prefill fallback in disaggregated prefill mode)

See the Frontend Guide for full configuration options.

Next Steps

Document	Description
Configuration Reference	All CLI arguments, env vars, and HTTP endpoints
Frontend Guide	KServe gRPC configuration and integration
NVIDIA Request Extensions (nvext)	Custom request fields for routing hints and cache control
Router Documentation	KV-aware routing configuration