Working with Dynamo Kubernetes Operator#
Overview#
Dynamo operator is a Kubernetes operator that simplifies the deployment, configuration, and lifecycle management of DynamoGraphs. It automates the reconciliation of custom resources to ensure your desired state is always achieved. This operator is ideal for users who want to manage complex deployments using declarative YAML definitions and Kubernetes-native tooling.
Architecture#
Operator Deployment: Deployed as a Kubernetes
Deployment
in a specific namespace.Controllers:
DynamoGraphDeploymentController
: WatchesDynamoGraphDeployment
CRs and orchestrates graph deployments.DynamoComponentDeploymentController
: WatchesDynamoComponentDeployment
CRs and handles individual component deployments.DynamoComponentController
: WatchesDynamoComponent
CRs and manages image builds and artifact tracking.
Workflow:
A custom resource is created by the user or API server.
The corresponding controller detects the change and runs reconciliation.
Kubernetes resources (Deployments, Services, etc.) are created or updated to match the CR spec.
Status fields are updated to reflect the current state.
Custom Resource Definitions (CRDs)#
CRD: DynamoGraphDeployment
#
Field |
Type |
Description |
Required |
Default |
---|---|---|---|---|
|
string |
Reference to the dynamoComponent identifier |
Yes |
|
|
map |
Map of service names to runtime configurations. This allows the user to override the service configuration defined in the DynamoComponentDeployment. |
No |
API Version: nvidia.com/v1alpha1
Scope: Namespaced
Example#
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: disagg
spec:
dynamoComponent: frontend:jh2o6dqzpsgfued4
envs:
- name: GLOBAL_ENV_VAR
value: some_global_value
services:
Frontend:
replicas: 1
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
Processor:
replicas: 1
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
VllmWorker:
replicas: 1
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
PrefillWorker:
replicas: 1
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
Router:
replicas: 0
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
CRD: DynamoComponentDeployment
#
Field |
Type |
Description |
Required |
Default |
---|---|---|---|---|
|
string |
Namespace of the DynamoComponent |
Yes |
|
|
string |
Name of the dynamoComponent artifact |
Yes |
|
|
string |
FQDN of the service to run |
Yes |
|
|
string |
Logical name of the service being deployed |
Yes |
|
|
array |
Environment variables for runtime |
No |
|
|
map |
External service dependencies |
No |
|
|
map |
Additional metadata annotations for the pod |
No |
|
|
map |
Custom labels applied to the deployment and pod |
No |
|
|
object |
Resource limits and requests (CPU, memory, GPU) |
No |
|
|
object |
Autoscaling rules for the deployment |
No |
|
|
string |
Reference to a secret for injecting env vars |
No |
|
|
object |
Persistent volume claim configuration |
No |
|
|
object |
Ingress configuration for exposing the service |
No |
|
|
object |
Additional labels and annotations for the pod |
No |
|
|
object |
Custom PodSpec fields to merge into the generated pod |
No |
|
|
object |
Kubernetes liveness probe |
No |
|
|
object |
Kubernetes readiness probe |
No |
|
|
int |
Number of replicas to run |
No |
|
API Version: nvidia.com/v1alpha1
Scope: Namespaced
Example#
apiVersion: nvidia.com/v1alpha1
kind: DynamoComponentDeployment
metadata:
name: test-41fa991-vllmworker
spec:
dynamoNamespace: dynamo
dynamoComponent: frontend:jh2o6dqzpsgfued4
dynamoTag: graphs.disagg:Frontend
envs:
- name: DYN_DEPLOYMENT_CONFIG
value: '<long JSON config>'
externalServices:
PrefillWorker:
deploymentSelectorKey: dynamo
deploymentSelectorValue: PrefillWorker/dynamo
resources:
limits:
cpu: "10"
gpu: "1"
memory: 20Gi
requests:
cpu: "500m"
gpu: "1"
memory: 20Gi
serviceName: Frontend
CRD: DynamoComponent
#
Field |
Type |
Description |
Required |
Default |
---|---|---|---|---|
|
string |
Name of the dynamoComponent artifact |
Yes |
|
|
string |
Custom container image. If not specified, an image will be built |
No |
|
|
Duration |
Timeout duration for the image building process |
No |
|
|
[]string |
Additional arguments to pass to the container image build process |
No |
|
|
ExtraPodMetadata |
Additional metadata to add to the image builder pod |
No |
|
|
ExtraPodSpec |
Additional pod spec configurations for the image builder pod |
No |
|
|
[]EnvVar |
Additional environment variables for the image builder container |
No |
|
|
ResourceRequirements |
Resource requirements (CPU, memory) for the image builder container |
No |
|
|
[]LocalObjectReference |
Secrets required for pulling private container images |
No |
|
|
string |
Name of the secret containing Docker registry credentials |
No |
|
|
[]EnvFromSource |
Environment variables to be sourced for the downloader container |
No |
API Version: nvidia.com/v1alpha1
Scope: Namespaced
Example#
apiVersion: nvidia.com/v1alpha1
kind: DynamoComponent
metadata:
name: frontend--jh2o6dqzpsgfued4
spec:
dynamoComponent: frontend:jh2o6dqzpsgfued4
Installation#
GitOps Deployment with FluxCD#
This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We’ll use the aggregated vLLM example to demonstrate the workflow.
Prerequisites#
A Kubernetes cluster with Dynamo Cloud installed
FluxCD installed in your cluster
A Git repository to store your deployment configurations
Dynamo CLI installed locally
Workflow Overview#
The GitOps workflow for Dynamo deployments consists of three main steps:
Build and push a pipeline to the Dynamo API store
Create and commit a DynamoGraphDeployment custom resource for initial deployment
Update the pipeline by building a new version and updating the CR for subsequent updates
Step 1: Build and Push Pipeline#
First, build and push your pipeline using the Dynamo CLI:
# Set your project root directory
export PROJECT_ROOT=$(pwd)
# Configure environment variables
export KUBE_NS=dynamo-cloud
export DYNAMO_CLOUD=http://localhost:8080 # If using port-forward
# OR
# export DYNAMO_CLOUD=https://dynamo-cloud.nvidia.com # If using Ingress/VirtualService
# Build and push the service
cd $PROJECT_ROOT/examples/llm
DYNAMO_TAG=$(dynamo build --push graphs.agg:Frontend | grep "Successfully built" | awk '{ print $NF }' | sed 's/\.$//')
The --push
flag ensures the pipeline is pushed to the remote API store, making it available for deployment.
Step 2: Create Initial Deployment#
Create a new file in your Git repository (e.g., deployments/llm-agg.yaml
) with the following content:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: llm-agg
spec:
dynamoComponent: frontend:jh2o6dqzpsgfued4 # Use the tag from Step 1
services:
Frontend:
replicas: 1
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
Processor:
replicas: 1
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
VllmWorker:
replicas: 1
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
# Add PVC for model storage
pvc:
name: vllm-model-storage
mountPath: /models
size: 100Gi
Commit and push this file to your Git repository. FluxCD will detect the new CR and create the initial deployment in your cluster. The operator will:
Create the specified PVCs
Build container images for all components
Deploy the services with the configured resources
Step 3: Update Existing Deployment#
To update your pipeline:
Build and push a new version of your pipeline:
DYNAMO_TAG=$(dynamo build --push graphs.agg:Frontend | grep "Successfully built" | awk '{ print $NF }' | sed 's/\.$//')
Update the
dynamoComponent
field in your CR with the new tag:
spec:
dynamoComponent: frontend:new_tag_here # Update with new tag from Step 1
Commit and push the changes to your Git repository.
The Dynamo operator will:
Detect the updated CR
Build new container images for the updated components
Perform a rolling update of the deployments when the new images are ready and the components are ready to serve traffic
Preserve existing PVCs and their data
Monitoring the Deployment#
You can monitor the deployment status using:
# Check the DynamoGraphDeployment status
kubectl get dynamographdeployment llm-agg -n $KUBE_NS
# Check the component deployments
kubectl get dynamocomponentdeployment -n $KUBE_NS
Deploying a Dynamo Pipeline using the Operator#
Reconciliation Logic#
DynamoGraphDeployment#
Actions:
Create a DynamoComponent CR to build the docker image
Create a DynamoComponentDeployment CR for each component defined in the Dynamo graph being deployed
Status Management:
.status.conditions
: Reflects readiness, failure, progress states.status.state
: overall state of the deployment, based on the state of the DynamoComponentDeployments
DynamoComponentDeployment#
Actions:
Create a Deployment, Service, and Ingress for the service
Status Management:
.status.conditions
: Reflects readiness, failure, progress states
DynamoComponent#
Actions:
Create a job to build the docker image
Status Management:
.status.conditions
: Reflects readiness, failure, progress states
Configuration#
Environment Variables:
Name |
Description |
Default |
---|---|---|
|
Adds namespace prefix to image names |
|
|
Engine used for building images |
|
|
BuildKit daemon URL |
|
|
Repository name for dynamo images |
|
|
Use secure connection for registry |
|
|
Docker registry server address |
|
|
Registry authentication username |
|
|
Enable eStargz image optimization |
|
|
BuildKit image |
|
|
Logging verbosity level |
|
|
Api store service endpoint |
|
|
Namespace for image building |
|
|
System namespace |
|
Flags:
Flag
Description
Default
--natsAddr
Address of NATS server
“”
--etcdAddr
Address of etcd server
“”
Troubleshooting#
Symptom |
Possible Cause |
Solution |
---|---|---|
Resource not created |
RBAC missing |
Ensure correct ClusterRole/Binding |
Status not updated |
CRD schema mismatch |
Regenerate CRDs with kubebuilder |
Image build hangs |
Misconfigured DynamoComponent |
Check image build logs |
Development#
Code Structure:
The operator is built using Kubebuilder and the operator-sdk, with the following structure:
controllers/
: Reconciliation logicapi/v1alpha1/
: CRD typesconfig/
: Manifests and Helm charts