Deploying Your First Model

View as Markdown

Deploying Your First Model

End-to-end tutorial for deploying Qwen/Qwen3-0.6B on Kubernetes using Dynamo’s recommended DynamoGraphDeploymentRequest (DGDR) workflow — from zero to your first inference response.

This guide assumes you have already completed the platform installation and that the Dynamo operator and CRDs are running in your cluster.

What is a DynamoGraphDeploymentRequest?

A DynamoGraphDeploymentRequest (DGDR) is Dynamo’s deploy-by-intent API. You describe what you want to run and your performance targets; Dynamo’s profiler determines the optimal configuration automatically, then creates the live deployment for you.

DGDR (this guide)DGD (manual)
You provideModel + optional SLA targetsFull deployment spec
ProfilingAutomatedYou bring your own config
Best forGetting started, SLA-driven deploymentsFine-grained control

For a deeper comparison, see Understanding Dynamo’s Custom Resources.

Prerequisites

Before starting, confirm:

  • Platform installed: kubectl get pods -n ${NAMESPACE} shows operator pods Running
  • CRDs present: kubectl get crd | grep dynamo shows dynamographdeploymentrequests.nvidia.com
  • kubectl and helm available in your shell

Set these variables once — they are referenced throughout the guide:

$export NAMESPACE=dynamo-system # namespace where the platform is installed
$export RELEASE_VERSION=1.x.x # match the installed platform version (e.g. 1.0.0)
$export HF_TOKEN=<your-hf-token> # HuggingFace token

Qwen/Qwen3-0.6B is a public model. A HuggingFace token is not strictly required to download it, but is recommended to avoid rate limiting.

Step 1: Configure Namespace and Secrets

$# Create the namespace (idempotent — safe to run even if it already exists)
$kubectl create namespace ${NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -
$
$# Create the HuggingFace token secret for model download
$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN="${HF_TOKEN}" \
> -n ${NAMESPACE}

Verify the secret was created:

$kubectl get secret hf-token-secret -n ${NAMESPACE}

Step 2: Create the DynamoGraphDeploymentRequest

Save the following as qwen3-first-model.yaml:

1apiVersion: nvidia.com/v1beta1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: qwen3-first-model
5spec:
6 # Model to profile and deploy
7 model: Qwen/Qwen3-0.6B
8
9 # Container image for the profiling job — must match your installed platform version.
10 # This is the same dynamo-frontend image used by the deployed inference service.
11 image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:${RELEASE_VERSION}"

Apply it (uses envsubst to substitute the RELEASE_VERSION shell variable into the YAML):

$envsubst < qwen3-first-model.yaml | kubectl apply -f - -n ${NAMESPACE}

Field reference

FieldRequiredDefaultPurpose
modelYesHuggingFace model ID (e.g. Qwen/Qwen3-0.6B)
imageNoContainer image for the profiling job (dynamo-frontend)
backendNoautoInference engine (auto, vllm, sglang, trtllm)
searchStrategyNorapidProfiling depth — rapid (~30s, AIC simulation) or thorough (2–4h, real GPUs)
autoApplyNotrueAutomatically create and start the deployment after profiling
slaNoTarget latency (TTFT, ITL in ms) for profiler optimization
workloadNoExpected traffic shape (ISL, OSL, request rate)
hardwareNoauto-detectedGPU SKU and count override; required when GPU discovery is disabled. When not set, the auto-discovered GPU count is capped at 32 — set hardware.totalGpus explicitly to use more.

For the full spec reference, see the DGDR API Reference and Profiler Guide.

If you are using a namespace-scoped operator with GPU discovery disabled, you must also provide explicit hardware info or the DGDR will be rejected at admission:

1spec:
2 ...
3 hardware:
4 numGpusPerNode: 1
5 gpuSku: "H100-SXM5-80GB"
6 vramMb: 81920

See the installation guide for details.

Step 3: Monitor Profiling Progress

Profiling is the automated step where Dynamo sweeps across candidate configurations (parallelism, batching, scheduling strategies) to find the one that best meets your SLA and hardware — so you don’t have to tune it manually.

Watch the DGDR status in real time:

$kubectl get dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE} -w

The PHASE column progresses through:

PhaseWhat is happening
Pending (condition: DiscoveringHardware)Spec validated; operator is discovering GPU hardware and preparing the profiling job
ProfilingProfiling job is running (AIC simulation or real-GPU sweep)
ReadyProfiling complete; optimal config stored in .status. Terminal state when autoApply: false
DeployingCreating the DynamoGraphDeployment (only when autoApply: true)
DeployedDGD is running and healthy
FailedUnrecoverable error — check events for details

Deployed is the success terminal state when autoApply: true (the default). If you set autoApply: false, the phase stops at Ready — profiling is complete and the generated DGD spec is stored in .status, but no deployment is created automatically. To inspect and deploy it manually:

$# View the generated DGD spec
$kubectl get dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE} \
> -o jsonpath='{.status.profilingResults.selectedConfig}' | python3 -m json.tool
$
$# Save it and apply
$kubectl get dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE} \
> -o jsonpath='{.status.profilingResults.selectedConfig}' > generated-dgd.yaml
$kubectl apply -f generated-dgd.yaml -n ${NAMESPACE}

For a full status summary and events:

$kubectl describe dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE}

To follow the profiling job logs:

$# Find the profiling pod
$kubectl get pods -n ${NAMESPACE} -l nvidia.com/dgdr-name=qwen3-first-model
$
$# Stream its logs
$kubectl logs -f <profiling-pod-name> -n ${NAMESPACE}
With searchStrategy: rapid, profiling typically completes in under 15 minutes on a single GPU.

Step 4: Verify the Deployment

Once the DGDR reaches Deployed, the DynamoGraphDeployment has been created automatically. Check that everything is running:

$# See the auto-created DGD
$kubectl get dynamographdeployment -n ${NAMESPACE}
$
$# Confirm all pods are Running
$kubectl get pods -n ${NAMESPACE}

Wait until pods are ready:

$kubectl wait --for=condition=ready pod \
> -l nvidia.com/dynamo-deployment=qwen3-first-model \
> -n ${NAMESPACE} \
> --timeout=600s

Find the frontend service name:

$kubectl get svc -n ${NAMESPACE} | grep frontend

Step 5: Send Your First Request

Port-forward to the frontend and send an inference request:

$# Start port-forward (replace <frontend-service-name> with the name from Step 4)
$kubectl port-forward svc/<frontend-service-name> 8000:8000 -n ${NAMESPACE} &
$
$# Confirm the model is available
$curl http://localhost:8000/v1/models
$
$# Send a chat completion request
$curl http://localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen3-0.6B",
> "messages": [{"role": "user", "content": "What is NVIDIA Dynamo?"}],
> "max_tokens": 200
> }'

A successful response looks like:

1{
2 "id": "chatcmpl-...",
3 "object": "chat.completion",
4 "model": "Qwen/Qwen3-0.6B",
5 "choices": [{
6 "message": {
7 "role": "assistant",
8 "content": "NVIDIA Dynamo is a high-performance inference framework..."
9 }
10 }]
11}

Your first model is now live.

Cleanup

To remove the deployment and profiling artifacts:

$kubectl delete dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE}

Deleting a DGDR does not delete the DynamoGraphDeployment it created. The DGD persists independently so it can continue serving traffic.

Troubleshooting

DGDR stuck in Pending

$kubectl describe dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE}
$# Check the Events section at the bottom

Common causes: no available GPU nodes, image pull failure (check image tag; NGC credentials are optional but may be needed if you hit rate limits pulling from public NGC), missing hardware config for a namespace-scoped operator.

GPU node taints are a frequent cause of pods staying Pending. Many clusters (including GKE by default and most shared/HPC environments) taint GPU nodes with nvidia.com/gpu:NoSchedule so that only GPU-aware workloads land on them. If the profiling job pod is stuck with a 0/N nodes are available: … node(s) had untolerated taint event, add a toleration to your DGDR via overrides.profilingJob. The operator and profiler automatically forward it to every candidate and deployed pod:

1spec:
2 ...
3 overrides:
4 profilingJob:
5 template:
6 spec:
7 containers: [] # required placeholder; leave empty to inherit defaults
8 tolerations:
9 - key: nvidia.com/gpu
10 operator: Exists
11 effect: NoSchedule

Profiling job fails

$kubectl get pods -n ${NAMESPACE} -l nvidia.com/dgdr-name=qwen3-first-model
$kubectl logs <profiling-pod-name> -n ${NAMESPACE}
$# If the pod has already exited:
$kubectl logs <profiling-pod-name> -n ${NAMESPACE} --previous

Pods not starting after profiling

$kubectl describe pod <pod-name> -n ${NAMESPACE}
$# Look for ImagePullBackOff, OOMKilled, or Insufficient resources

Model not responding after port-forward

$# Check frontend is ready
$kubectl get pods -n ${NAMESPACE} | grep frontend
$
$# Check frontend logs
$kubectl logs <frontend-pod-name> -n ${NAMESPACE}

Next Steps

  • Tune for production SLAs: Add sla (TTFT, ITL) and workload (ISL, OSL) targets to your DGDR so the profiler optimizes for your specific traffic. See the Profiler Guide for the full configuration reference and picking modes. For ready-to-use YAML — including SLA targets, private models, MoE, and overrides — see DGDR Examples.
  • Scale the deployment: Autoscaling guide
  • SLA-aware autoscaling: Enable the Planner via features.planner in the DGDR — see the Planner Guide.
  • Inspect the generated config: Set autoApply: false and extract the DGD spec with kubectl get dgdr <name> -o jsonpath='{.status.profilingResults.selectedConfig}' before deploying.
  • Direct control: Creating Deployments — write your own DynamoGraphDeployment spec for full customization.
  • Monitor performance: Observability
  • Try specific backends: vLLM, SGLang, TensorRT-LLM