The scripts in the examples/<backend>/launch folder like agg.sh demonstrate how you can serve your models locally.
The corresponding YAML files like agg.yaml show you how you could create a Kubernetes deployment for your inference graph.
This guide explains how to create your own deployment files.
Before choosing a template, understand the different architecture patterns:
Pattern: Prefill and decode on the same GPU in a single process.
Suggested to use for:
Tradeoffs:
Example: agg.yaml
Pattern: Load balancer routing across multiple aggregated worker instances.
Suggested to use for:
Tradeoffs:
Example: agg_router.yaml
Pattern: Separate prefill and decode workers with specialized optimization.
Suggested to use for:
Tradeoffs:
Example: disagg_router.yaml
Select the architecture pattern as your template that best fits your use case.
For example, when using the vLLM backend:
Development / Testing: Use agg.yaml as the base configuration.
Production with Load Balancing: Use agg_router.yaml to enable scalable, load-balanced inference.
High Performance / Disaggregated Deployment: Use disagg_router.yaml for maximum throughput and modular scalability.
You can run the Frontend on one machine, for example a CPU node, and the worker on a different machine (a GPU node). The Frontend serves as a framework-agnostic HTTP entry point and is likely not to need many changes.
It serves the following roles:
/v1/chat/completions endpointYou should then pick a worker and specialize the config. For example,
Hereās a template structure based on the examples:
Consult the corresponding sh file. Each of the python commands to launch a component will go into your yaml spec under the
extraPodSpec: -> mainContainer: -> args:
The front end is launched with āpython3 -m dynamo.frontend [āhttp-port 8000] [ārouter-mode kv]ā
Each worker will launch python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags command.
By default, the Dynamo operator automatically discovers and injects image pull secrets based on container registry host matching. The operator scans Docker config secrets within the same namespace and matches their registry hostnames to the container image URLs, automatically injecting the appropriate secrets into the podās imagePullSecrets.
Disabling Automatic Discovery: To disable this behavior for a component and manually control image pull secrets:
When disabled, you can manually specify secrets as you would for a normal pod spec via:
This automatic discovery eliminates the need to manually configure image pull secrets for each deployment.
After your base model deployment is running, you can deploy LoRA adapters using the DynamoModel custom resource. This allows you to fine-tune and extend your models without modifying the base deployment.
To add a LoRA adapter to your deployment, link it using modelRef in your worker configuration:
Then create a DynamoModel resource for your LoRA:
For complete details on managing models and LoRA adapters, see: š Managing Models with DynamoModel Guide