Llama-3.3-70B FP8

Serve Llama-3.3-70B FP8 with Dynamo and vLLM, sized from one node to two.

View as Markdown

Three validated vLLM topologies for RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic — aggregated TP4, single-node prefill/decode split, and two-node prefill/decode — all benchmarked with the same 8K ISL / 1K OSL traffic at 16 concurrency per GPU so you can compare normalized TPS/GPU across footprints. Pick your topology; every command on this page updates to match.

Choose your deployment target

Topology
Checkpoint RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamicPrecision FP8 dynamicGPUs 4x H100/H200, one TP4 workerTechniques Aggregated servingWorkload 8K ISL / 1K OSL, 64 concurrency
Checkpoint RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamicPrecision FP8 dynamicGPUs 8x H100/H200: 2x TP2 prefill + 1x TP4 decodeTechniques Disaggregated P/D over NIXLWorkload 8K ISL / 1K OSL, 128 concurrency
Checkpoint RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamicPrecision FP8 dynamicGPUs 16x H100/H200: 1x TP8 prefill + 1x TP8 decodeTechniques Disaggregated P/D over NIXLWorkload 8K ISL / 1K OSL, 256 concurrency

Prerequisites

  • A Kubernetes cluster with the Dynamo platform installed and 4x H100 or H200 available on one node.
  • A Kubernetes cluster with the Dynamo platform installed and 8x H100 or H200 available on one node.
  • A Kubernetes cluster with the Dynamo platform installed and 16x H100 or H200 available across two nodes.
  • A Hugging Face token with access to RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic.

Create the namespace and token secret:

$export NAMESPACE=dynamo-demo
$kubectl create namespace ${NAMESPACE}
$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN="your-token-here" \
> -n ${NAMESPACE}

Update storageClassName in model-cache/model-cache.yaml to match your cluster, and edit namespace, image tags, node selectors, and Hugging Face secrets before applying these manifests. Model download takes approximately 15-30 minutes depending on network speed.

Deploy

Prepare the model cache and download the checkpoint (shared by all three targets):

$# Update storageClassName in model-cache.yaml first!
$kubectl apply -f recipes/llama-3-70b/model-cache/ -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s

Then deploy:

$kubectl apply -f recipes/llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/llama-3-70b/vllm/disagg-single-node/deploy.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/llama-3-70b/vllm/disagg-multi-node/deploy.yaml -n ${NAMESPACE}

Smoke Test

Send a test request to verify the deployment serves traffic:

$kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE}
$kubectl port-forward svc/llama3-70b-disagg-sn-frontend 8000:8000 -n ${NAMESPACE}
$kubectl port-forward svc/llama3-70b-disagg-mn-frontend 8000:8000 -n ${NAMESPACE}
$curl http://localhost:8000/v1/chat/completions \
> -H 'Content-Type: application/json' \
> -d '{"model":"RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":64}'

Benchmark

Each target ships a perf.yaml Kubernetes Job that waits for the model to come up, then runs AIPerf with the same traffic shape: ISL=8192, OSL=1024, 16 concurrency per GPU (so total concurrency scales with the target’s GPU count), and a request count of 10x total concurrency.

The Job wraps this AIPerf run (64 concurrency, 640 requests):

$aiperf profile \
> --model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
> --endpoint-type chat --endpoint /v1/chat/completions --streaming \
> --url http://llama3-70b-agg-frontend:8000 \
> --synthetic-input-tokens-mean 8192 --synthetic-input-tokens-stddev 0 \
> --output-tokens-mean 1024 --output-tokens-stddev 0 \
> --extra-inputs max_tokens:1024 --extra-inputs min_tokens:1024 --extra-inputs ignore_eos:true \
> --concurrency 64 --request-count 640
$kubectl apply -f recipes/llama-3-70b/vllm/agg/perf.yaml -n ${NAMESPACE}
$kubectl logs -f job/llama3-70b-agg-perf -n ${NAMESPACE}

The Job wraps this AIPerf run (128 concurrency, 1,280 requests):

$aiperf profile \
> --model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
> --endpoint-type chat --endpoint /v1/chat/completions --streaming \
> --url http://llama3-70b-disagg-sn-frontend:8000 \
> --synthetic-input-tokens-mean 8192 --synthetic-input-tokens-stddev 0 \
> --output-tokens-mean 1024 --output-tokens-stddev 0 \
> --extra-inputs max_tokens:1024 --extra-inputs min_tokens:1024 --extra-inputs ignore_eos:true \
> --concurrency 128 --request-count 1280
$kubectl apply -f recipes/llama-3-70b/vllm/disagg-single-node/perf.yaml -n ${NAMESPACE}
$kubectl logs -f job/llama3-70b-disagg-sn-perf -n ${NAMESPACE}

The Job wraps this AIPerf run (256 concurrency, 2,560 requests):

$aiperf profile \
> --model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
> --endpoint-type chat --endpoint /v1/chat/completions --streaming \
> --url http://llama3-70b-disagg-mn-frontend:8000 \
> --synthetic-input-tokens-mean 8192 --synthetic-input-tokens-stddev 0 \
> --output-tokens-mean 1024 --output-tokens-stddev 0 \
> --extra-inputs max_tokens:1024 --extra-inputs min_tokens:1024 --extra-inputs ignore_eos:true \
> --concurrency 256 --request-count 2560
$kubectl apply -f recipes/llama-3-70b/vllm/disagg-multi-node/perf.yaml -n ${NAMESPACE}
$kubectl logs -f job/llama3-70b-disagg-mn-perf -n ${NAMESPACE}

Compare All Targets

All three targets serve RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic on the vLLM runtime (nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.1) and are benchmarked at 16 concurrency per GPU with 8K ISL / 1K OSL traffic:

AggregatedDisagg single-nodeDisagg multi-node
GPUs4x H100/H2008x H100/H20016x H100/H200
Nodes112
Workers1x TP42x TP2 prefill + 1x TP4 decode1x TP8 prefill + 1x TP8 decode
TechniqueAggregated servingDisaggregated prefill/decodeDisaggregated prefill/decode
Total concurrency64128256

Notes

  • FP8 dynamic quantization is applied at runtime; the served model is RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic.
  • GPU counts differ by target (4 / 8 / 16), so compare total throughput and TPS/GPU — the topology comparison lives on the related Feature Benchmarks page.
  • A GAIE (Gateway API Inference Extension) integration example is included: apply the manifests under vllm/agg/gaie/ (or vllm/disagg-single-node/gaie/) to front the deployment with an inference gateway. These are integration artifacts, not separate recipe targets.

Source