SGLang Kubernetes Deployment Configurations#
This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying SGLang inference graphs using the DynamoGraphDeployment resource.
Available Deployment Patterns#
1. Aggregated Deployment (agg.yaml
)#
Basic deployment pattern with frontend and a single decode worker.
Architecture:
Frontend
: OpenAI-compatible API serverSGLangDecodeWorker
: Single worker handling both prefill and decode
2. Aggregated Router Deployment (agg_router.yaml
)#
Enhanced aggregated deployment with KV cache routing capabilities.
Architecture:
Frontend
: OpenAI-compatible API server with router mode enabled (--router-mode kv
)SGLangDecodeWorker
: Single worker handling both prefill and decode
3. Disaggregated Deployment (disagg.yaml
)**#
High-performance deployment with separated prefill and decode workers.
Architecture:
Frontend
: HTTP API server coordinating between workersSGLangDecodeWorker
: Specialized decode-only worker (--disaggregation-mode decode
)SGLangPrefillWorker
: Specialized prefill-only worker (--disaggregation-mode prefill
)Communication via NIXL transfer backend (
--disaggregation-transfer-backend nixl
)
CRD Structure#
All templates use the DynamoGraphDeployment CRD:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: <deployment-name>
spec:
services:
<ServiceName>:
# Service configuration
Key Configuration Options#
Resource Management:
resources:
requests:
cpu: "10"
memory: "20Gi"
gpu: "1"
limits:
cpu: "10"
memory: "20Gi"
gpu: "1"
Container Configuration:
extraPodSpec:
mainContainer:
image: my-registry/sglang-runtime:my-tag
workingDir: /workspace/components/backends/sglang
args:
- "python3"
- "-m"
- "dynamo.sglang.worker"
# Model-specific arguments
Prerequisites#
Before using these templates, ensure you have:
Dynamo Cloud Platform installed - See Installing Dynamo Cloud
Kubernetes cluster with GPU support
Container registry access for SGLang runtime images
HuggingFace token secret (referenced as
envFromSecret: hf-token-secret
)
Usage#
1. Choose Your Template#
Select the deployment pattern that matches your requirements:
Use
agg.yaml
for development/testingUse
agg_router.yaml
for production with load balancingUse
disagg.yaml
for maximum performance
2. Customize Configuration#
Edit the template to match your environment:
# Update image registry and tag
image: your-registry/sglang-runtime:your-tag
# Configure your model
args:
- "--model-path"
- "your-org/your-model"
- "--served-model-name"
- "your-org/your-model"
3. Deploy#
Use the following command to deploy the deployment file.
First, create a secret for the HuggingFace token.
export HF_TOKEN=your_hf_token
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${NAMESPACE}
Then, deploy the model using the deployment file.
export DEPLOYMENT_FILE=agg.yaml
kubectl apply -f $DEPLOYMENT_FILE -n ${NAMESPACE}
4. Using Custom Dynamo Frameworks Image for SGLang#
To use a custom dynamo frameworks image for SGLang, you can update the deployment file using yq:
export DEPLOYMENT_FILE=agg.yaml
export FRAMEWORK_RUNTIME_IMAGE=<sglang-image>
yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE > $DEPLOYMENT_FILE.generated
kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE
Model Configuration#
All templates use DeepSeek-R1-Distill-Llama-8B as the default model. But you can use any sglang argument and configuration. Key parameters:
Monitoring and Health#
Frontend health endpoint:
http://<frontend-service>:8000/health
Liveness probes: Check process health every 60s
Further Reading#
Deployment Guide: Creating Kubernetes Deployments
Quickstart: Deployment Quickstart
Platform Setup: Dynamo Cloud Installation
Examples: Deployment Examples
Kubernetes CRDs: Custom Resources Documentation
Troubleshooting#
Common issues and solutions:
Pod fails to start: Check image registry access and HuggingFace token secret
GPU not allocated: Verify cluster has GPU nodes and proper resource limits
Health check failures: Review model loading logs and increase
initialDelaySeconds
Out of memory: Increase memory limits or reduce model batch size
For additional support, refer to the deployment troubleshooting guide.