--- title: Deploying Your First Model --- # Deploying Your First Model End-to-end tutorial for deploying `Qwen/Qwen3-0.6B` on Kubernetes using Dynamo's recommended `DynamoGraphDeploymentRequest` (DGDR) workflow — from zero to your first inference response. This guide assumes you have already completed the [platform installation](/dynamo/dev/kubernetes-deployment/deployment-guide/detailed-installation-guide) and that the Dynamo operator and CRDs are running in your cluster. ## What is a DynamoGraphDeploymentRequest? A `DynamoGraphDeploymentRequest` (DGDR) is Dynamo's **deploy-by-intent** API. You describe what you want to run and your performance targets; Dynamo's profiler determines the optimal configuration automatically, then creates the live deployment for you. | | DGDR (this guide) | DGD (manual) | |---|---|---| | **You provide** | Model + optional SLA targets | Full deployment spec | | **Profiling** | Automated | You bring your own config | | **Best for** | Getting started, SLA-driven deployments | Fine-grained control | For a deeper comparison, see [Understanding Dynamo's Custom Resources](/dynamo/dev/kubernetes-deployment/deployment-guide#understanding-dynamos-custom-resources). ## Prerequisites Before starting, confirm: - Platform installed: `kubectl get pods -n ${NAMESPACE}` shows operator pods `Running` - CRDs present: `kubectl get crd | grep dynamo` shows `dynamographdeploymentrequests.nvidia.com` - `kubectl` and `helm` available in your shell Set these variables once — they are referenced throughout the guide: ```bash export NAMESPACE=dynamo-system # namespace where the platform is installed export RELEASE_VERSION=1.x.x # match the installed platform version (e.g. 1.0.0) export HF_TOKEN= # HuggingFace token ``` `Qwen/Qwen3-0.6B` is a public model. A HuggingFace token is not strictly required to download it, but is recommended to avoid rate limiting. ## Step 1: Configure Namespace and Secrets ```bash # Create the namespace (idempotent — safe to run even if it already exists) kubectl create namespace ${NAMESPACE} --dry-run=client -o yaml | kubectl apply -f - # Create the HuggingFace token secret for model download kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN="${HF_TOKEN}" \ -n ${NAMESPACE} ``` Verify the secret was created: ```bash kubectl get secret hf-token-secret -n ${NAMESPACE} ``` ## Step 2: Create the DynamoGraphDeploymentRequest Save the following as `qwen3-first-model.yaml`: ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: qwen3-first-model spec: # Model to profile and deploy model: Qwen/Qwen3-0.6B # Container image for the profiling job — must match your installed platform version. # This is the same dynamo-frontend image used by the deployed inference service. image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:${RELEASE_VERSION}" ``` Apply it (uses `envsubst` to substitute the `RELEASE_VERSION` shell variable into the YAML): ```bash envsubst < qwen3-first-model.yaml | kubectl apply -f - -n ${NAMESPACE} ``` ### Field reference | Field | Required | Default | Purpose | |---|---|---|---| | `model` | Yes | — | HuggingFace model ID (e.g. `Qwen/Qwen3-0.6B`) | | `image` | No | — | Container image for the profiling job (`dynamo-frontend`) | | `backend` | No | `auto` | Inference engine (`auto`, `vllm`, `sglang`, `trtllm`) | | `searchStrategy` | No | `rapid` | Profiling depth — `rapid` (~30s, AIC simulation) or `thorough` (2–4h, real GPUs) | | `autoApply` | No | `true` | Automatically create and start the deployment after profiling | | `sla` | No | — | Target latency (TTFT, ITL in ms) for profiler optimization | | `workload` | No | — | Expected traffic shape (ISL, OSL, request rate) | | `hardware` | No | auto-detected | GPU SKU and count override; required when GPU discovery is disabled. When not set, the auto-discovered GPU count is capped at 32 — set `hardware.totalGpus` explicitly to use more. | For the full spec reference, see the [DGDR API Reference](/dynamo/dev/additional-resources/api-reference-k-8-s) and [Profiler Guide](/dynamo/dev/components/profiler/profiler-guide). If you are using a **namespace-scoped operator** with GPU discovery disabled, you must also provide explicit hardware info or the DGDR will be rejected at admission: ```yaml spec: ... hardware: numGpusPerNode: 1 gpuSku: "H100-SXM5-80GB" vramMb: 81920 ``` See the [installation guide](/dynamo/dev/kubernetes-deployment/deployment-guide/detailed-installation-guide#gpu-discovery-for-dynamographdeploymentrequests-with-namespace-scoped-operators) for details. ## Step 3: Monitor Profiling Progress Profiling is the automated step where Dynamo sweeps across candidate configurations (parallelism, batching, scheduling strategies) to find the one that best meets your SLA and hardware — so you don't have to tune it manually. Watch the DGDR status in real time: ```bash kubectl get dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE} -w ``` The `PHASE` column progresses through: | Phase | What is happening | |---|---| | `Pending` (condition: `DiscoveringHardware`) | Spec validated; operator is discovering GPU hardware and preparing the profiling job | | `Profiling` | Profiling job is running (AIC simulation or real-GPU sweep) | | `Ready` | Profiling complete; optimal config stored in `.status`. Terminal state when `autoApply: false` | | `Deploying` | Creating the `DynamoGraphDeployment` (only when `autoApply: true`) | | `Deployed` | DGD is running and healthy | | `Failed` | Unrecoverable error — check events for details | `Deployed` is the success terminal state when `autoApply: true` (the default). If you set `autoApply: false`, the phase stops at `Ready` — profiling is complete and the generated DGD spec is stored in `.status`, but no deployment is created automatically. To inspect and deploy it manually: ```bash # View the generated DGD spec kubectl get dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE} \ -o jsonpath='{.status.profilingResults.selectedConfig}' | python3 -m json.tool # Save it and apply kubectl get dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE} \ -o jsonpath='{.status.profilingResults.selectedConfig}' > generated-dgd.yaml kubectl apply -f generated-dgd.yaml -n ${NAMESPACE} ``` For a full status summary and events: ```bash kubectl describe dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE} ``` To follow the profiling job logs: ```bash # Find the profiling pod kubectl get pods -n ${NAMESPACE} -l nvidia.com/dgdr-name=qwen3-first-model # Stream its logs kubectl logs -f -n ${NAMESPACE} ``` With `searchStrategy: rapid`, profiling typically completes in under 15 minutes on a single GPU. ## Step 4: Verify the Deployment Once the DGDR reaches `Deployed`, the `DynamoGraphDeployment` has been created automatically. Check that everything is running: ```bash # See the auto-created DGD kubectl get dynamographdeployment -n ${NAMESPACE} # Confirm all pods are Running kubectl get pods -n ${NAMESPACE} ``` Wait until pods are ready: ```bash kubectl wait --for=condition=ready pod \ -l nvidia.com/dynamo-deployment=qwen3-first-model \ -n ${NAMESPACE} \ --timeout=600s ``` Find the frontend service name: ```bash kubectl get svc -n ${NAMESPACE} | grep frontend ``` ## Step 5: Send Your First Request Port-forward to the frontend and send an inference request: ```bash # Start port-forward (replace with the name from Step 4) kubectl port-forward svc/ 8000:8000 -n ${NAMESPACE} & # Confirm the model is available curl http://localhost:8000/v1/models # Send a chat completion request curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "What is NVIDIA Dynamo?"}], "max_tokens": 200 }' ``` A successful response looks like: ```json { "id": "chatcmpl-...", "object": "chat.completion", "model": "Qwen/Qwen3-0.6B", "choices": [{ "message": { "role": "assistant", "content": "NVIDIA Dynamo is a high-performance inference framework..." } }] } ``` Your first model is now live. ## Cleanup To remove the deployment and profiling artifacts: ```bash kubectl delete dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE} ``` Deleting a DGDR does **not** delete the `DynamoGraphDeployment` it created. The DGD persists independently so it can continue serving traffic. ## Troubleshooting **DGDR stuck in `Pending`** ```bash kubectl describe dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE} # Check the Events section at the bottom ``` Common causes: no available GPU nodes, image pull failure (check image tag; NGC credentials are optional but may be needed if you hit rate limits pulling from public NGC), missing `hardware` config for a namespace-scoped operator. **GPU node taints** are a frequent cause of pods staying `Pending`. Many clusters (including GKE by default and most shared/HPC environments) taint GPU nodes with `nvidia.com/gpu:NoSchedule` so that only GPU-aware workloads land on them. If the profiling job pod is stuck with a `0/N nodes are available: … node(s) had untolerated taint` event, add a toleration to your DGDR via `overrides.profilingJob`. The operator and profiler automatically forward it to every candidate and deployed pod: ```yaml spec: ... overrides: profilingJob: template: spec: containers: [] # required placeholder; leave empty to inherit defaults tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule ``` **Profiling job fails** ```bash kubectl get pods -n ${NAMESPACE} -l nvidia.com/dgdr-name=qwen3-first-model kubectl logs -n ${NAMESPACE} # If the pod has already exited: kubectl logs -n ${NAMESPACE} --previous ``` **Pods not starting after profiling** ```bash kubectl describe pod -n ${NAMESPACE} # Look for ImagePullBackOff, OOMKilled, or Insufficient resources ``` **Model not responding after port-forward** ```bash # Check frontend is ready kubectl get pods -n ${NAMESPACE} | grep frontend # Check frontend logs kubectl logs -n ${NAMESPACE} ``` ## Next Steps - **Tune for production SLAs**: Add `sla` (TTFT, ITL) and `workload` (ISL, OSL) targets to your DGDR so the profiler optimizes for your specific traffic. See the [Profiler Guide](/dynamo/dev/components/profiler/profiler-guide) for the full configuration reference and picking modes. For ready-to-use YAML — including SLA targets, private models, MoE, and overrides — see [DGDR Examples](/dynamo/dev/components/profiler/profiler-examples). - **Scale the deployment**: [Autoscaling guide](/dynamo/dev/kubernetes-deployment/deployment-guide/autoscaling) - **SLA-aware autoscaling**: Enable the Planner via `features.planner` in the DGDR — see the [Planner Guide](/dynamo/dev/components/planner/planner-guide). - **Inspect the generated config**: Set `autoApply: false` and extract the DGD spec with `kubectl get dgdr -o jsonpath='{.status.profilingResults.selectedConfig}'` before deploying. - **Direct control**: [Creating Deployments](/dynamo/dev/additional-resources/creating-deployments) — write your own `DynamoGraphDeployment` spec for full customization. - **Monitor performance**: [Observability](/dynamo/dev/kubernetes-deployment/observability-k-8-s/metrics) - **Try specific backends**: [vLLM](/dynamo/dev/backends/v-llm), [SGLang](/dynamo/dev/backends/sg-lang), [TensorRT-LLM](/dynamo/dev/backends/tensor-rt-llm)