> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt.

# Qwen3-235B-A22B FP8

Each target below is a validated TensorRT-LLM deployment of Qwen3-235B-A22B-FP8 — a 235B-parameter Mixture-of-Experts model with \~22B active parameters per token — on 16 GPUs with KV-aware routing, benchmarked at 4K ISL / 200 OSL. Hopper and Blackwell need different MoE backends, and you can serve aggregated or with prefill/decode disaggregation. Pick your GPU architecture and serving topology; every command on this page updates to match.

<p>
  Choose your deployment target
</p>

GPU

Blackwell (B100/B200) Recommended

<input type="radio" id="recipe-sku-hopper" name="recipe-sku" value="hopper" />

Hopper (H100/H200)

Topology

Aggregated

<input type="radio" id="recipe-variant-disagg" name="recipe-variant" value="disagg" />

Disaggregated

<b>Checkpoint</b> Qwen/Qwen3-235B-A22B-FP8

<b>GPUs</b> 16x B100/B200, 4 workers

<b>Parallelism</b> TP4 x EP4 per worker

<b>MoE backend</b> DEEPGEMM (required on SM100+)

<b>Routing</b> KV-aware

<b>Workload</b> 4K ISL / 200 OSL, concurrency 32

<b>Checkpoint</b> Qwen/Qwen3-235B-A22B-FP8

<b>GPUs</b> 16x B100/B200

<b>Parallelism</b> 6x prefill (TP2) + 1x decode (TP4 x EP4)

<b>MoE backend</b> DEEPGEMM (required on SM100+)

<b>Routing</b> KV-aware

<b>Workload</b> 4K ISL / 200 OSL, concurrency 32

<b>Checkpoint</b> Qwen/Qwen3-235B-A22B-FP8

<b>GPUs</b> 16x H100/H200, 4 workers

<b>Parallelism</b> TP4 x EP4 per worker

<b>MoE backend</b> CUTLASS (TRT-LLM default)

<b>Routing</b> KV-aware

<b>Workload</b> 4K ISL / 200 OSL, concurrency 32

<b>Checkpoint</b> Qwen/Qwen3-235B-A22B-FP8

<b>GPUs</b> 16x H100/H200

<b>Parallelism</b> 6x prefill (TP2) + 1x decode (TP4 x EP4)

<b>MoE backend</b> CUTLASS (TRT-LLM default)

<b>Routing</b> KV-aware

<b>Workload</b> 4K ISL / 200 OSL, concurrency 32

## Prerequisites

* A Kubernetes cluster with the Dynamo platform installed and **16x B100/B200** available (\~1.3 TB total GPU VRAM).
* A Hugging Face token with access to `Qwen/Qwen3-235B-A22B-FP8`.

- A Kubernetes cluster with the Dynamo platform installed and **16x H100/H200** available (\~1.3 TB total GPU VRAM).
- A Hugging Face token with access to `Qwen/Qwen3-235B-A22B-FP8`.

Create the namespace and token secret:

```bash
export NAMESPACE=your-namespace
kubectl create namespace ${NAMESPACE}
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="your-token" \
  -n ${NAMESPACE}
```

Edit namespace, storage class, image tags, node selectors, resource claims, and cluster-specific placement in the manifests before applying them.

## Deploy

Prepare the model cache and download the checkpoint (30-60 minutes):

```bash
# Edit storageClassName in model-cache/model-cache.yaml to match your cluster first
# (kubectl get storageclass).
kubectl apply -f recipes/qwen3-235b-a22b-fp8/model-cache/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
```

Then deploy:

```bash
kubectl apply -f recipes/qwen3-235b-a22b-fp8/trtllm/agg/blackwell/deploy.yaml -n ${NAMESPACE}
```

```bash
kubectl apply -f recipes/qwen3-235b-a22b-fp8/trtllm/disagg/blackwell/deploy.yaml -n ${NAMESPACE}
```

```bash
kubectl apply -f recipes/qwen3-235b-a22b-fp8/trtllm/agg/hopper/deploy.yaml -n ${NAMESPACE}
```

```bash
kubectl apply -f recipes/qwen3-235b-a22b-fp8/trtllm/disagg/hopper/deploy.yaml -n ${NAMESPACE}
```

## Smoke Test

Send a test request to verify the deployment serves traffic:

```bash
kubectl port-forward svc/qwen3-235b-a22b-agg-frontend 8000:8000 -n ${NAMESPACE}
```

```bash
kubectl port-forward svc/qwen3-235b-a22b-disagg-frontend 8000:8000 -n ${NAMESPACE}
```

```bash
curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"Qwen/Qwen3-235B-A22B-FP8","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":64}'
```

## Benchmark

Every target ships a `perf.yaml` AIPerf Job sized at 4K ISL / 200 OSL and concurrency 32 (2 per GPU x 16 GPUs) with `--request-count 320`, so aggregated vs disaggregated results are comparable within the same hardware architecture. Artifacts land on the `model-cache` PVC under `/model-cache/perf`.

The aggregated Job wraps this AIPerf run:

```bash
aiperf profile \
  --model Qwen/Qwen3-235B-A22B-FP8 --tokenizer Qwen/Qwen3-235B-A22B-FP8 \
  --endpoint-type chat --endpoint /v1/chat/completions \
  --url http://qwen3-235b-a22b-agg-frontend:8000 --streaming \
  --synthetic-input-tokens-mean 4000 --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 200 --output-tokens-stddev 0 \
  --extra-inputs max_tokens:200 --extra-inputs min_tokens:200 --extra-inputs ignore_eos:true \
  --concurrency 32 --request-count 320 --warmup-request-count 32 \
  --random-seed 100
```

The disaggregated Job wraps this AIPerf run:

```bash
aiperf profile \
  --model Qwen/Qwen3-235B-A22B-FP8 --tokenizer Qwen/Qwen3-235B-A22B-FP8 \
  --endpoint-type chat --endpoint /v1/chat/completions \
  --url http://qwen3-235b-a22b-disagg-frontend:8000 --streaming \
  --synthetic-input-tokens-mean 4000 --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 200 --output-tokens-stddev 0 \
  --extra-inputs max_tokens:200 --extra-inputs min_tokens:200 --extra-inputs ignore_eos:true \
  --concurrency 32 --request-count 320 --warmup-request-count 32 \
  --random-seed 100
```

Apply the manifest matching your deployed target:

```bash
kubectl apply -f recipes/qwen3-235b-a22b-fp8/trtllm/agg/blackwell/perf.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/qwen3-235b-a22b-bench -n ${NAMESPACE} --timeout=7200s
```

```bash
kubectl apply -f recipes/qwen3-235b-a22b-fp8/trtllm/disagg/blackwell/perf.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/qwen3-235b-a22b-disagg-bench -n ${NAMESPACE} --timeout=7200s
```

```bash
kubectl apply -f recipes/qwen3-235b-a22b-fp8/trtllm/agg/hopper/perf.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/qwen3-235b-a22b-bench -n ${NAMESPACE} --timeout=7200s
```

```bash
kubectl apply -f recipes/qwen3-235b-a22b-fp8/trtllm/disagg/hopper/perf.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/qwen3-235b-a22b-disagg-bench -n ${NAMESPACE} --timeout=7200s
```

## Compare All Targets

All four targets serve `Qwen/Qwen3-235B-A22B-FP8` on TensorRT-LLM (PyTorch backend) over 16 GPUs with KV-aware routing and the same 4K ISL / 200 OSL benchmark shape:

|                     | Blackwell aggregated  | Blackwell disaggregated                  | Hopper aggregated     | Hopper disaggregated                     |
| ------------------- | --------------------- | ---------------------------------------- | --------------------- | ---------------------------------------- |
| **GPUs**            | 16x B100/B200         | 16x B100/B200                            | 16x H100/H200         | 16x H100/H200                            |
| **Topology**        | 4x workers, TP4 x EP4 | 6x prefill (TP2) + 1x decode (TP4 x EP4) | 4x workers, TP4 x EP4 | 6x prefill (TP2) + 1x decode (TP4 x EP4) |
| **MoE backend**     | DEEPGEMM              | DEEPGEMM                                 | CUTLASS (default)     | CUTLASS (default)                        |
| **Chunked prefill** | Enabled               | Disabled                                 | Enabled               | Disabled                                 |
| **Workload**        | 4K / 200, conc. 32    | 4K / 200, conc. 32                       | 4K / 200, conc. 32    | 4K / 200, conc. 32                       |

## Notes

* The Hopper/Blackwell split is required with TRT-LLM 1.3.x: the default CUTLASS MoE backend falls through to a Hopper-specific JIT path on SM100 and crashes, so Blackwell needs `moe_config.backend: DEEPGEMM`. DEEPGEMM in turn crashes on Hopper due to a scale-factor dtype mismatch — hence two separate variants per topology.
* Chunked prefill is enabled for the aggregated targets and disabled for the disaggregated targets.
* All targets use KV-aware routing (`--router-mode kv`) at the frontend.
* Model download may take 30-60 minutes; update `storageClassName` in `model-cache/model-cache.yaml` before deploying.

## Source

* Source README: [recipes/qwen3-235b-a22b-fp8/README.md](https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-235b-a22b-fp8/README.md)
* Aggregated Blackwell: [deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-235b-a22b-fp8/trtllm/agg/blackwell/deploy.yaml) and [perf.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-235b-a22b-fp8/trtllm/agg/blackwell/perf.yaml)
* Disaggregated Blackwell: [deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-235b-a22b-fp8/trtllm/disagg/blackwell/deploy.yaml) and [perf.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-235b-a22b-fp8/trtllm/disagg/blackwell/perf.yaml)
* Aggregated Hopper: [deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-235b-a22b-fp8/trtllm/agg/hopper/deploy.yaml) and [perf.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-235b-a22b-fp8/trtllm/agg/hopper/perf.yaml)
* Disaggregated Hopper: [deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-235b-a22b-fp8/trtllm/disagg/hopper/deploy.yaml) and [perf.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-235b-a22b-fp8/trtllm/disagg/hopper/perf.yaml)
* Setup assets: [model-cache/model-cache.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-235b-a22b-fp8/model-cache/model-cache.yaml) and [model-cache/model-download.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-235b-a22b-fp8/model-cache/model-download.yaml)