Working with Dynamo Kubernetes Operator#

Overview#

Dynamo operator is a Kubernetes operator that simplifies the deployment, configuration, and lifecycle management of DynamoGraphs. It automates the reconciliation of custom resources to ensure your desired state is always achieved. This operator is ideal for users who want to manage complex deployments using declarative YAML definitions and Kubernetes-native tooling.

Architecture#

  • Operator Deployment: Deployed as a Kubernetes Deployment in a specific namespace.

  • Controllers:

    • DynamoGraphDeploymentController: Watches DynamoGraphDeployment CRs and orchestrates graph deployments.

    • DynamoComponentDeploymentController: Watches DynamoComponentDeployment CRs and handles individual component deployments.

  • Workflow:

    1. A custom resource is created by the user or API server.

    2. The corresponding controller detects the change and runs reconciliation.

    3. Kubernetes resources (Deployments, Services, etc.) are created or updated to match the CR spec.

    4. Status fields are updated to reflect the current state.

Custom Resource Definitions (CRDs)#

CRD: DynamoGraphDeployment#

Field

Type

Description

Required

Default

services

map

Map of service names to runtime configurations. This allows the user to override the service configuration defined in the DynamoComponentDeployment.

Yes

envs

list

list of global environment variables.

No

API Version: nvidia.com/v1alpha1 Scope: Namespaced

Example#

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: disagg
spec:
  envs:
  - name: GLOBAL_ENV_VAR
    value: some_global_value
  services:
    Frontend:
      replicas: 1
      envs:
      - name: SPECIFIC_ENV_VAR
        value: some_specific_value
    Processor:
      replicas: 1
      envs:
      - name: SPECIFIC_ENV_VAR
        value: some_specific_value
    VllmWorker:
      replicas: 1
      envs:
      - name: SPECIFIC_ENV_VAR
        value: some_specific_value
    PrefillWorker:
      replicas: 1
      envs:
      - name: SPECIFIC_ENV_VAR
        value: some_specific_value

Installation#

See installation steps

GitOps Deployment with FluxCD#

This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We’ll use the aggregated vLLM example to demonstrate the workflow.

Prerequisites#

  • A Kubernetes cluster with Dynamo Cloud installed

  • FluxCD installed in your cluster

  • A Git repository to store your deployment configurations

Workflow Overview#

The GitOps workflow for Dynamo deployments consists of three main steps:

  1. Build and push the Dynamo Operator

  2. Create and commit a DynamoGraphDeployment custom resource for initial deployment

  3. Update the graph by building a new version and updating the CR for subsequent updates

Step 1: Build and Push Dynamo Cloud Operator#

First, follow to See Install Dynamo Cloud.

Step 2: Create Initial Deployment#

Create a new file in your Git repository (e.g., deployments/llm-agg.yaml) with the following content:

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: llm-agg
spec:
  services:
    Frontend:
      replicas: 1
      envs:
      - name: SPECIFIC_ENV_VAR
        value: some_specific_value
    Processor:
      replicas: 1
      envs:
      - name: SPECIFIC_ENV_VAR
        value: some_specific_value
    VllmWorker:
      replicas: 1
      envs:
      - name: SPECIFIC_ENV_VAR
        value: some_specific_value
      # Add PVC for model storage
      pvc:
        name: vllm-model-storage
        mountPath: /models
        size: 100Gi

Commit and push this file to your Git repository. FluxCD will detect the new CR and create the initial deployment in your cluster. The operator will:

  • Create the specified PVCs

  • Build container images for all components

  • Deploy the services with the configured resources

Step 3: Update Existing Deployment#

To update your pipeline, just update the associated DynamoGraphDeployment CRD

The Dynamo operator will automatically reconcile it.

Monitoring the Deployment#

You can monitor the deployment status using:

export NAMESPACE=<namespace-with-the-dynamo-cloud-operator>

# Check the DynamoGraphDeployment status
kubectl get dynamographdeployment llm-agg -n $NAMESPACE

Reconciliation Logic#

DynamoGraphDeployment#

  • Actions:

    • Create a DynamoComponent CR to build the docker image

    • Create a DynamoComponentDeployment CR for each component defined in the Dynamo graph being deployed

  • Status Management:

    • .status.conditions: Reflects readiness, failure, progress states

    • .status.state: overall state of the deployment, based on the state of the DynamoComponentDeployments

DynamoComponentDeployment#

  • Actions:

    • Create a Deployment, Service, and Ingress for the service

  • Status Management:

    • .status.conditions: Reflects readiness, failure, progress states

Configuration#

  • Environment Variables:

Name

Description

Default

LOG_LEVEL

Logging verbosity level

info

DYNAMO_SYSTEM_NAMESPACE

System namespace

dynamo

  • Flags:

    Flag

    Description

    Default

    --natsAddr

    Address of NATS server

    “”

    --etcdAddr

    Address of etcd server

    “”

Troubleshooting#

Symptom

Possible Cause

Solution

Resource not created

RBAC missing

Ensure correct ClusterRole/Binding

Status not updated

CRD schema mismatch

Regenerate CRDs with kubebuilder

Image build hangs

Misconfigured DynamoComponent

Check image build logs

Development#

  • Code Structure:

The operator is built using Kubebuilder and the operator-sdk, with the following structure:

  • controllers/: Reconciliation logic

  • api/v1alpha1/: CRD types

  • config/: Manifests and Helm charts

References#