Developing the Operator with Tilt

Fast, live-reload development loop for the Dynamo Kubernetes operator

View as Markdown

Overview

Tilt provides a live-reload development environment for the Dynamo Kubernetes operator. Instead of manually building images, pushing to a registry, and redeploying on every change, Tilt watches your source files and automatically recompiles the Go binary, syncs it into the running container, and restarts the process — all in seconds.

Under the hood, the Tiltfile:

  1. Compiles the Go manager binary locally (CGO_ENABLED=0).
  2. Builds a minimal Docker image containing only the binary.
  3. Renders the production Helm chart (deploy/helm/charts/platform) with helm template, applies CRDs via kubectl, and deploys all rendered resources.
  4. Live-updates the binary inside the running container on every code change — no full image rebuild required.

This gives you a fully working cluster where you can apply DynamoGraphDeployment and DynamoGraphDeploymentRequest resources and have them reconcile into real workloads — while iterating on controller logic with sub-second feedback.

Prerequisites

ToolVersionPurpose
Tiltv0.33+Development orchestration
Helmv3Chart rendering
Go1.25+Compiling the operator
kubectlCluster access
A Kubernetes clusterkind, minikube, or remote cluster

You also need a container registry that is accessible to your cluster’s nodes, so they can pull the operator image Tilt builds. If you use a local cluster like kind with a local registry, Tilt can push there directly.

Quick Start

$cd deploy/operator
$
$# Create your personal settings file (gitignored)
$cat > tilt-settings.yaml <<EOF
$allowed_contexts:
$ - my-cluster-context
$registry: docker.io/myuser
$skip_codegen: true
$EOF
$
$# Launch
$tilt up

Tilt opens a terminal UI and a web dashboard at http://localhost:10350. The dashboard shows resource status, build logs, and port-forwards.

Press Space in the terminal to open the web UI. Press Ctrl-C to shut everything down (resources remain deployed; run tilt down to tear them down).

Tilt web UI showing the operator, CRDs, and infrastructure resources

Configuration

All configuration is optional. The Tiltfile defines sensible defaults for every setting, and tilt-settings.yaml is gitignored so your personal values (cluster context, registry, etc.) never leak into the repo.

Create deploy/operator/tilt-settings.yaml with any of the settings below:

1# Kubernetes contexts Tilt is allowed to connect to.
2# Safety guard: prevents accidental deployments to production clusters.
3allowed_contexts:
4 - my-cluster-context
5
6# Container registry for the operator image.
7# Can also be set via the REGISTRY env var (env var takes precedence).
8registry: docker.io/myuser
9
10# Skip running `make generate && make manifests` before applying CRDs.
11# Set to true when you haven't changed API types (faster iteration).
12skip_codegen: true
13
14# Target namespace for the operator and related resources.
15# namespace: dynamo-system
16
17# Subchart toggles
18# enable_nats: true # Required for DGD/DGDR workloads (default: true)
19# enable_etcd: false # Only if discoveryBackend is "etcd"
20# enable_kai_scheduler: false # GPU-aware scheduling for multi-node
21# enable_grove: false # PodClique-based multi-node orchestration
22
23# Extra Helm value overrides (applied on top of subchart toggles)
24# helm_values:
25# dynamo-operator.discoveryBackend: kubernetes
26# dynamo-operator.natsAddr: "nats://external-nats:4222"

Settings Reference

KeyTypeDefaultDescription
allowed_contextslist(none)Kubernetes contexts Tilt may connect to. Prevents accidental production deploys.
registrystring""Container registry prefix (e.g. docker.io/myuser). Also settable via REGISTRY env var, which takes precedence.
namespacestringdynamo-systemNamespace for the operator Deployment and related resources.
skip_codegenboolfalseSkip make generate && make manifests before applying CRDs. Set to true when you haven’t changed API types.
enable_natsbooltrueDeploy NATS subchart. Required for DGD/DGDR workloads (workers use it for communication).
enable_etcdboolfalseDeploy etcd subchart. Only needed when discoveryBackend is etcd.
enable_kai_schedulerboolfalseDeploy kai-scheduler for GPU-aware scheduling in multi-node setups.
enable_groveboolfalseDeploy Grove for PodClique-based multi-node orchestration.
image_pull_secretstring""Name of a docker-registry Secret for pulling images from private registries.
helm_valuesmap{}Arbitrary --set overrides passed to helm template.
operator_versionstring(from Chart.yaml)Operator --operator-version flag. Defaults to appVersion from the operator subchart.

Registry Configuration

The operator image needs to be pullable by your cluster’s nodes. The registry is resolved in priority order:

  1. REGISTRY env varREGISTRY=docker.io/myuser tilt up
  2. registry in tilt-settings.yaml

The image is pushed as {registry}/controller:tilt-dev.

If no registry is configured, the image is only available locally. This works with kind using a local registry but will fail on remote clusters.

How It Works

When you run tilt up, the following resources are created in order:

manager-build Compile Go binary locally
├───── crds Apply CRDs via server-side apply
operator Deploy operator pod (live-updated)

The operator handles webhook certificate generation, CA bundle injection, and MPI SSH key provisioning at runtime — no external setup needed.

What Each Resource Does

manager-build — Runs CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build to compile the operator binary. Re-runs on changes to api/, cmd/, internal/, go.mod, or go.sum.

crds — Applies CRDs from the Helm chart via kubectl apply --server-side. When skip_codegen is false, runs make generate && make manifests first.

operator — The operator Deployment itself. Tilt watches the compiled binary and uses live_update to sync it into the running container and restart the process — no image rebuild needed. On startup, the operator’s built-in cert controller generates a self-signed TLS certificate, injects the CA bundle into webhook configurations, and creates the MPI SSH secret — matching production behavior exactly.

Live Update Cycle

The inner development loop looks like this:

  1. You edit Go source files under deploy/operator/.
  2. Tilt detects the change and recompiles the binary (~2-5 seconds).
  3. The new binary is synced into the running container via live_update.
  4. The process restarts automatically.
  5. Your controller changes are live — test by applying a DGD/DGDR.

No docker build, no docker push, no kubectl rollout restart.

Webhook Certificates

The operator handles webhook TLS certificates automatically at runtime using a built-in cert controller (based on OPA cert-controller). On startup it:

  1. Creates a self-signed CA and webhook serving certificate.
  2. Stores them in the webhook-server-cert Secret.
  3. Injects the CA bundle into ValidatingWebhookConfiguration and MutatingWebhookConfiguration resources.

This matches production behavior and requires no external tooling. For alternative certificate management (cert-manager or external certs), see the webhook documentation and configure via helm_values in tilt-settings.yaml.

Typical Workflows

Iterating on Controller Logic

The most common workflow — you’re modifying reconciliation logic and want fast feedback:

1# tilt-settings.yaml
2allowed_contexts: [my-cluster]
3registry: docker.io/myuser
4skip_codegen: true
$tilt up
$# Edit files under internal/controller/
$# Tilt auto-recompiles and live-updates
$# Apply test resources:
$kubectl apply -f examples/backends/vllm/deploy/agg.yaml

Changing API Types (CRDs)

When you modify files under api/, you need codegen to run:

1# tilt-settings.yaml
2skip_codegen: false # or omit — false is the default

Tilt will run make generate && make manifests and re-apply CRDs whenever api/ files change.

Testing Multi-Node Features

Enable the necessary subcharts:

1# tilt-settings.yaml
2enable_grove: true
3enable_kai_scheduler: true

Using Environment Variables

You can override the registry without editing the settings file:

$REGISTRY=ghcr.io/myorg tilt up

Tilt UI

The web UI at http://localhost:10350 shows:

  • Resource status — green/red/pending for each resource
  • Build logs — compilation output and errors
  • Runtime logs — operator logs streamed in real time
  • Port forwards — the health endpoint is forwarded to localhost:8081

Resources are grouped by label (operator and infrastructure) to keep the UI organized.

Cleanup

$# Stop Tilt and leave resources deployed
$# (Ctrl-C in the terminal)
$
$# Stop Tilt and tear down all resources
$tilt down

Troubleshooting

Image Pull Errors

If pods show ImagePullBackOff:

  • Verify registry is set in tilt-settings.yaml or via REGISTRY env var.
  • Ensure your cluster nodes can pull from that registry.
  • For kind with a local registry, follow the kind local registry guide.

Webhook TLS Errors

If applying a DGD/DGDR fails with x509: certificate signed by unknown authority:

  • Check the operator logs in the Tilt UI — the cert controller logs its progress on startup.
  • Verify the webhook-server-cert Secret exists and has been populated:
    $kubectl -n dynamo-system get secret webhook-server-cert
  • The operator may need a few seconds after startup to generate certs and inject the CA bundle. Wait for the cert-controller log messages before applying resources.

CRD Codegen Failures

If crds fails with codegen errors:

  • Ensure controller-gen is installed: make controller-gen
  • Try running codegen manually: make generate && make manifests
  • Set skip_codegen: true temporarily to bypass if you haven’t changed API types.

Context Safety Guard

If Tilt refuses to start with a context error, add your cluster context to allowed_contexts in tilt-settings.yaml:

1allowed_contexts:
2 - my-cluster-context