> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt.

# Llama-3.3-70B FP8

Three validated vLLM topologies for `RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic` — aggregated TP4, single-node prefill/decode split, and two-node prefill/decode — all benchmarked with the same 8K ISL / 1K OSL traffic at 16 concurrency per GPU so you can compare normalized TPS/GPU across footprints. Pick your topology; every command on this page updates to match.

<p>
  Choose your deployment target
</p>

Topology

<input type="radio" id="recipe-variant-agg" name="recipe-variant" value="agg" />

Aggregated (4 GPU)

Disagg single-node (8 GPU) Recommended

<input type="radio" id="recipe-variant-disagg-multi-node" name="recipe-variant" value="disagg-multi-node" />

Disagg multi-node (16 GPU)

<b>Checkpoint</b> RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic

<b>Precision</b> FP8 dynamic

<b>GPUs</b> 4x H100/H200, one TP4 worker

<b>Techniques</b> Aggregated serving

<b>Workload</b> 8K ISL / 1K OSL, 64 concurrency

<b>Checkpoint</b> RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic

<b>Precision</b> FP8 dynamic

<b>GPUs</b> 8x H100/H200: 2x TP2 prefill + 1x TP4 decode

<b>Techniques</b> Disaggregated P/D over NIXL

<b>Workload</b> 8K ISL / 1K OSL, 128 concurrency

<b>Checkpoint</b> RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic

<b>Precision</b> FP8 dynamic

<b>GPUs</b> 16x H100/H200: 1x TP8 prefill + 1x TP8 decode

<b>Techniques</b> Disaggregated P/D over NIXL

<b>Workload</b> 8K ISL / 1K OSL, 256 concurrency

## Prerequisites

* A Kubernetes cluster with the Dynamo platform installed and **4x H100 or H200** available on one node.

- A Kubernetes cluster with the Dynamo platform installed and **8x H100 or H200** available on one node.

* A Kubernetes cluster with the Dynamo platform installed and **16x H100 or H200** available across two nodes.

- A Hugging Face token with access to `RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic`.

Create the namespace and token secret:

```bash
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="your-token-here" \
  -n ${NAMESPACE}
```

Update `storageClassName` in `model-cache/model-cache.yaml` to match your cluster, and edit namespace, image tags, node selectors, and Hugging Face secrets before applying these manifests. Model download takes approximately 15-30 minutes depending on network speed.

## Deploy

Prepare the model cache and download the checkpoint (shared by all three targets):

```bash
# Update storageClassName in model-cache.yaml first!
kubectl apply -f recipes/llama-3-70b/model-cache/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
```

Then deploy:

```bash
kubectl apply -f recipes/llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE}
```

```bash
kubectl apply -f recipes/llama-3-70b/vllm/disagg-single-node/deploy.yaml -n ${NAMESPACE}
```

```bash
kubectl apply -f recipes/llama-3-70b/vllm/disagg-multi-node/deploy.yaml -n ${NAMESPACE}
```

## Smoke Test

Send a test request to verify the deployment serves traffic:

```bash
kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE}
```

```bash
kubectl port-forward svc/llama3-70b-disagg-sn-frontend 8000:8000 -n ${NAMESPACE}
```

```bash
kubectl port-forward svc/llama3-70b-disagg-mn-frontend 8000:8000 -n ${NAMESPACE}
```

```bash
curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":64}'
```

## Benchmark

Each target ships a `perf.yaml` Kubernetes Job that waits for the model to come up, then runs AIPerf with the same traffic shape: ISL=8192, OSL=1024, 16 concurrency per GPU (so total concurrency scales with the target's GPU count), and a request count of 10x total concurrency.

The Job wraps this AIPerf run (64 concurrency, 640 requests):

```bash
aiperf profile \
  --model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
  --endpoint-type chat --endpoint /v1/chat/completions --streaming \
  --url http://llama3-70b-agg-frontend:8000 \
  --synthetic-input-tokens-mean 8192 --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 1024 --output-tokens-stddev 0 \
  --extra-inputs max_tokens:1024 --extra-inputs min_tokens:1024 --extra-inputs ignore_eos:true \
  --concurrency 64 --request-count 640
```

```bash
kubectl apply -f recipes/llama-3-70b/vllm/agg/perf.yaml -n ${NAMESPACE}
kubectl logs -f job/llama3-70b-agg-perf -n ${NAMESPACE}
```

The Job wraps this AIPerf run (128 concurrency, 1,280 requests):

```bash
aiperf profile \
  --model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
  --endpoint-type chat --endpoint /v1/chat/completions --streaming \
  --url http://llama3-70b-disagg-sn-frontend:8000 \
  --synthetic-input-tokens-mean 8192 --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 1024 --output-tokens-stddev 0 \
  --extra-inputs max_tokens:1024 --extra-inputs min_tokens:1024 --extra-inputs ignore_eos:true \
  --concurrency 128 --request-count 1280
```

```bash
kubectl apply -f recipes/llama-3-70b/vllm/disagg-single-node/perf.yaml -n ${NAMESPACE}
kubectl logs -f job/llama3-70b-disagg-sn-perf -n ${NAMESPACE}
```

The Job wraps this AIPerf run (256 concurrency, 2,560 requests):

```bash
aiperf profile \
  --model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
  --endpoint-type chat --endpoint /v1/chat/completions --streaming \
  --url http://llama3-70b-disagg-mn-frontend:8000 \
  --synthetic-input-tokens-mean 8192 --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 1024 --output-tokens-stddev 0 \
  --extra-inputs max_tokens:1024 --extra-inputs min_tokens:1024 --extra-inputs ignore_eos:true \
  --concurrency 256 --request-count 2560
```

```bash
kubectl apply -f recipes/llama-3-70b/vllm/disagg-multi-node/perf.yaml -n ${NAMESPACE}
kubectl logs -f job/llama3-70b-disagg-mn-perf -n ${NAMESPACE}
```

## Compare All Targets

All three targets serve `RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic` on the vLLM runtime (`nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.1`) and are benchmarked at 16 concurrency per GPU with 8K ISL / 1K OSL traffic:

|                       | Aggregated         | Disagg single-node             | Disagg multi-node              |
| --------------------- | ------------------ | ------------------------------ | ------------------------------ |
| **GPUs**              | 4x H100/H200       | 8x H100/H200                   | 16x H100/H200                  |
| **Nodes**             | 1                  | 1                              | 2                              |
| **Workers**           | 1x TP4             | 2x TP2 prefill + 1x TP4 decode | 1x TP8 prefill + 1x TP8 decode |
| **Technique**         | Aggregated serving | Disaggregated prefill/decode   | Disaggregated prefill/decode   |
| **Total concurrency** | 64                 | 128                            | 256                            |

## Related Feature Benchmarks

* [Llama-3.3-70B topology benchmark](/dynamo/dev/benchmarks/llama-3-70b-topology) — how the three topologies compare when normalized by GPU.

## Notes

* FP8 dynamic quantization is applied at runtime; the served model is `RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic`.
* GPU counts differ by target (4 / 8 / 16), so compare total throughput and TPS/GPU — the topology comparison lives on the related Feature Benchmarks page.
* A GAIE (Gateway API Inference Extension) integration example is included: apply the manifests under [vllm/agg/gaie/](https://github.com/ai-dynamo/dynamo/tree/main/recipes/llama-3-70b/vllm/agg/gaie) (or [vllm/disagg-single-node/gaie/](https://github.com/ai-dynamo/dynamo/tree/main/recipes/llama-3-70b/vllm/disagg-single-node/gaie)) to front the deployment with an inference gateway. These are integration artifacts, not separate recipe targets.

## Source

* Source README: [recipes/llama-3-70b/README.md](https://github.com/ai-dynamo/dynamo/blob/main/recipes/llama-3-70b/README.md)
* vLLM aggregated: [deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/llama-3-70b/vllm/agg/deploy.yaml) and [perf.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/llama-3-70b/vllm/agg/perf.yaml)
* vLLM disagg single-node: [deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/llama-3-70b/vllm/disagg-single-node/deploy.yaml) and [perf.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/llama-3-70b/vllm/disagg-single-node/perf.yaml)
* vLLM disagg multi-node: [deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/llama-3-70b/vllm/disagg-multi-node/deploy.yaml) and [perf.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/llama-3-70b/vllm/disagg-multi-node/perf.yaml)
* GAIE integration: [vllm/agg/gaie/](https://github.com/ai-dynamo/dynamo/tree/main/recipes/llama-3-70b/vllm/agg/gaie)
* Setup assets: [recipes/llama-3-70b/model-cache/model-cache.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/llama-3-70b/model-cache/model-cache.yaml) and [recipes/llama-3-70b/model-cache/model-download.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/llama-3-70b/model-cache/model-download.yaml)