For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
      • Frontend Guide
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Feature Matrix
  • Quick Start
  • Prerequisites
  • HTTP Frontend
  • KServe gRPC Frontend
  • Kubernetes
  • Configuration
  • Next Steps
Components

Frontend

||View as Markdown|
Edit this page
Previous

vLLM

Next

Frontend Guide

The Dynamo Frontend is the API gateway for serving LLM inference requests. It provides OpenAI-compatible HTTP endpoints and KServe gRPC endpoints, handling request preprocessing, routing, and response formatting.

Feature Matrix

FeatureStatus
OpenAI Chat Completions API (/v1/chat/completions)✅ Supported
OpenAI Completions API (/v1/completions)✅ Supported
OpenAI Embeddings API (/v1/embeddings)✅ Supported
OpenAI Responses API (/v1/responses)✅ Supported
OpenAI Models API (/v1/models)✅ Supported
Image Generation (/v1/images/generations)✅ Supported
Video Generation (/v1/videos/generations)✅ Supported
Anthropic Messages API (/v1/messages)🧪 Experimental
KServe gRPC v2 API✅ Supported
Streaming responses (SSE)✅ Supported
Multi-model serving✅ Supported
Integrated KV-aware routing✅ Supported
Tool calling✅ Supported
TLS (HTTPS)✅ Supported
Swagger UI (/docs)✅ Supported
NVIDIA request extensions (nvext)✅ Supported

Quick Start

Prerequisites

  • Dynamo platform installed
  • etcd and nats-server -js running
  • At least one backend worker registered

HTTP Frontend

$python -m dynamo.frontend --http-port 8000

This starts an OpenAI-compatible HTTP server with integrated pre/post processing and routing. Backends are auto-discovered when they call register_model.

The frontend does the pre and post processing. To do this it will need access to the model configuration files: config.json, tokenizer.json, tokenizer_config.json, etc. It does not need the weights.

Frontend will download the files it needs from Hugging Face, no setup is required. However we recommend setting up modelexpress-server and a shared folder such as a Kubernetes PVC. This ensures the model is only downloaded once across the whole cluster.

If the model is not available on Hugging Face, such as a private or customized model, you will need to make the model files available locally at the same file path as on the backend. The backend’s --model-path <here> will need to exist on the frontend and contain at least the configuration (JSON) files.

KServe gRPC Frontend

$python -m dynamo.frontend --kserve-grpc-server

See the Frontend Guide for KServe-specific configuration and message formats.

Kubernetes

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: frontend-example
5spec:
6 graphs:
7 - name: frontend
8 replicas: 1
9 services:
10 - name: Frontend
11 image: nvcr.io/nvidia/dynamo/dynamo-vllm:latest
12 command:
13 - python
14 - -m
15 - dynamo.frontend
16 - --http-port
17 - "8000"

Configuration

ParameterDefaultDescription
--http-port8000HTTP server port
--kserve-grpc-serverfalseEnable KServe gRPC server
--router-moderound-robinRouting strategy: round-robin, random, kv, direct

See the Frontend Guide for full configuration options.

Next Steps

DocumentDescription
Configuration ReferenceAll CLI arguments, env vars, and HTTP endpoints
Frontend GuideKServe gRPC configuration and integration
NVIDIA Request Extensions (nvext)Custom request fields for routing hints and cache control
Router DocumentationKV-aware routing configuration