Llama-3.3-70B FP8 | NVIDIA Dynamo Documentation

Three validated vLLM topologies for RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic — aggregated TP4, single-node prefill/decode split, and two-node prefill/decode — all benchmarked with the same 8K ISL / 1K OSL traffic at 16 concurrency per GPU so you can compare normalized TPS/GPU across footprints. Pick your topology; every command on this page updates to match.

Choose your deployment target

TopologyAggregated (4 GPU)Disagg single-node (8 GPU) RecommendedDisagg multi-node (16 GPU)

Checkpoint RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamicPrecision FP8 dynamicGPUs 4x H100/H200, one TP4 workerTechniques Aggregated servingWorkload 8K ISL / 1K OSL, 64 concurrency

Checkpoint RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamicPrecision FP8 dynamicGPUs 8x H100/H200: 2x TP2 prefill + 1x TP4 decodeTechniques Disaggregated P/D over NIXLWorkload 8K ISL / 1K OSL, 128 concurrency

Checkpoint RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamicPrecision FP8 dynamicGPUs 16x H100/H200: 1x TP8 prefill + 1x TP8 decodeTechniques Disaggregated P/D over NIXLWorkload 8K ISL / 1K OSL, 256 concurrency

Prerequisites

A Kubernetes cluster with the Dynamo platform installed and 4x H100 or H200 available on one node.

A Kubernetes cluster with the Dynamo platform installed and 8x H100 or H200 available on one node.

A Kubernetes cluster with the Dynamo platform installed and 16x H100 or H200 available across two nodes.

A Hugging Face token with access to RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic.

Create the namespace and token secret:

$ export NAMESPACE=dynamo-demo
$ kubectl create namespace ${NAMESPACE}
$ kubectl create secret generic hf-token-secret \
>   --from-literal=HF_TOKEN="your-token-here" \
>   -n ${NAMESPACE}

Update storageClassName in model-cache/model-cache.yaml to match your cluster, and edit namespace, image tags, node selectors, and Hugging Face secrets before applying these manifests. Model download takes approximately 15-30 minutes depending on network speed.

Deploy

Prepare the model cache and download the checkpoint (shared by all three targets):

$ # Update storageClassName in model-cache.yaml first!
$ kubectl apply -f recipes/llama-3-70b/model-cache/ -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s

Then deploy:

$ kubectl apply -f recipes/llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE}

$ kubectl apply -f recipes/llama-3-70b/vllm/disagg-single-node/deploy.yaml -n ${NAMESPACE}

$ kubectl apply -f recipes/llama-3-70b/vllm/disagg-multi-node/deploy.yaml -n ${NAMESPACE}

Smoke Test

Send a test request to verify the deployment serves traffic:

$ kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE}

$ kubectl port-forward svc/llama3-70b-disagg-sn-frontend 8000:8000 -n ${NAMESPACE}

$ kubectl port-forward svc/llama3-70b-disagg-mn-frontend 8000:8000 -n ${NAMESPACE}

$ curl http://localhost:8000/v1/chat/completions \
>   -H 'Content-Type: application/json' \
>   -d '{"model":"RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":64}'

Benchmark

Each target ships a perf.yaml Kubernetes Job that waits for the model to come up, then runs AIPerf with the same traffic shape: ISL=8192, OSL=1024, 16 concurrency per GPU (so total concurrency scales with the target’s GPU count), and a request count of 10x total concurrency.

The Job wraps this AIPerf run (64 concurrency, 640 requests):

$ aiperf profile \
>   --model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
>   --endpoint-type chat --endpoint /v1/chat/completions --streaming \
>   --url http://llama3-70b-agg-frontend:8000 \
>   --synthetic-input-tokens-mean 8192 --synthetic-input-tokens-stddev 0 \
>   --output-tokens-mean 1024 --output-tokens-stddev 0 \
>   --extra-inputs max_tokens:1024 --extra-inputs min_tokens:1024 --extra-inputs ignore_eos:true \
>   --concurrency 64 --request-count 640

$ kubectl apply -f recipes/llama-3-70b/vllm/agg/perf.yaml -n ${NAMESPACE}
$ kubectl logs -f job/llama3-70b-agg-perf -n ${NAMESPACE}

The Job wraps this AIPerf run (128 concurrency, 1,280 requests):

$ aiperf profile \
>   --model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
>   --endpoint-type chat --endpoint /v1/chat/completions --streaming \
>   --url http://llama3-70b-disagg-sn-frontend:8000 \
>   --synthetic-input-tokens-mean 8192 --synthetic-input-tokens-stddev 0 \
>   --output-tokens-mean 1024 --output-tokens-stddev 0 \
>   --extra-inputs max_tokens:1024 --extra-inputs min_tokens:1024 --extra-inputs ignore_eos:true \
>   --concurrency 128 --request-count 1280

$ kubectl apply -f recipes/llama-3-70b/vllm/disagg-single-node/perf.yaml -n ${NAMESPACE}
$ kubectl logs -f job/llama3-70b-disagg-sn-perf -n ${NAMESPACE}

The Job wraps this AIPerf run (256 concurrency, 2,560 requests):

$ aiperf profile \
>   --model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
>   --endpoint-type chat --endpoint /v1/chat/completions --streaming \
>   --url http://llama3-70b-disagg-mn-frontend:8000 \
>   --synthetic-input-tokens-mean 8192 --synthetic-input-tokens-stddev 0 \
>   --output-tokens-mean 1024 --output-tokens-stddev 0 \
>   --extra-inputs max_tokens:1024 --extra-inputs min_tokens:1024 --extra-inputs ignore_eos:true \
>   --concurrency 256 --request-count 2560

$ kubectl apply -f recipes/llama-3-70b/vllm/disagg-multi-node/perf.yaml -n ${NAMESPACE}
$ kubectl logs -f job/llama3-70b-disagg-mn-perf -n ${NAMESPACE}

Compare All Targets

All three targets serve RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic on the vLLM runtime (nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.1) and are benchmarked at 16 concurrency per GPU with 8K ISL / 1K OSL traffic:

	Aggregated	Disagg single-node	Disagg multi-node
GPUs	4x H100/H200	8x H100/H200	16x H100/H200
Nodes	1	1	2
Workers	1x TP4	2x TP2 prefill + 1x TP4 decode	1x TP8 prefill + 1x TP8 decode
Technique	Aggregated serving	Disaggregated prefill/decode	Disaggregated prefill/decode
Total concurrency	64	128	256

Llama-3.3-70B topology benchmark — how the three topologies compare when normalized by GPU.

Notes

FP8 dynamic quantization is applied at runtime; the served model is RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic.
GPU counts differ by target (4 / 8 / 16), so compare total throughput and TPS/GPU — the topology comparison lives on the related Feature Benchmarks page.
A GAIE (Gateway API Inference Extension) integration example is included: apply the manifests under vllm/agg/gaie/ (or vllm/disagg-single-node/gaie/) to front the deployment with an inference gateway. These are integration artifacts, not separate recipe targets.

Source

Source README: recipes/llama-3-70b/README.md
vLLM aggregated: deploy.yaml and perf.yaml
vLLM disagg single-node: deploy.yaml and perf.yaml
vLLM disagg multi-node: deploy.yaml and perf.yaml
GAIE integration: vllm/agg/gaie/
Setup assets: recipes/llama-3-70b/model-cache/model-cache.yaml and recipes/llama-3-70b/model-cache/model-download.yaml