GPT-OSS-120B

Serve openai/gpt-oss-120b with Dynamo and TensorRT-LLM on Blackwell.

View as Markdown

Two validated TensorRT-LLM targets cover the two traffic shapes this model is most deployed for: aggregated expert-parallel (EP4, attention-DP) serving for short-prompt, high-concurrency traffic, and a prefill/decode split for long-context generation. They are deployment targets for different workloads, not an agg-vs-disagg comparison. Pick your target; every command on this page updates to match.

Choose your deployment target

Target
Checkpoint openai/gpt-oss-120bGPUs 4x GB200 (ARM64), TP4, EP4 + attention-DPWorkload Short prompts, long outputs, high concurrency (128 ISL / 1000 OSL)Runtime TensorRT-LLM (tensorrtllm-runtime:1.2.1)
Checkpoint openai/gpt-oss-120bGPUs 5x GB200/B200 (TP1 prefill + TP4 decode)Workload Long-context generation (8K ISL / 1K OSL)Quantization W4A8_MXFP4_MXFP8Runtime TensorRT-LLM (tensorrtllm-runtime:1.2.1)

Prerequisites

  • A Kubernetes cluster with the Dynamo platform installed and 4x GB200 available on ARM64 nodes — the aggregated target will not run on x86 Hopper/Ampere hardware.
  • A Hugging Face token with access to openai/gpt-oss-120b.
  • A Kubernetes cluster with the Dynamo platform installed and 5x GB200 or B200 available (1 prefill + 4 decode GPUs).
  • A Hugging Face token with access to openai/gpt-oss-120b.

Create the namespace and token secret:

$export NAMESPACE=your-namespace
$kubectl create namespace ${NAMESPACE}
$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN="your-token" \
> -n ${NAMESPACE}

Update storageClassName in model-cache/model-cache.yaml and the container image tag in deploy.yaml to match your Dynamo release before deploying. Also edit namespace, node selectors, and cluster-specific placement.

Deploy

Prepare the model cache (shared by both targets):

$kubectl apply -f recipes/gpt-oss-120b/model-cache/ -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s

Then deploy:

$kubectl apply -f recipes/gpt-oss-120b/trtllm/agg/deploy.yaml -n ${NAMESPACE}

Model loading takes roughly 15-30 minutes depending on storage speed:

$kubectl apply -f recipes/gpt-oss-120b/trtllm/disagg/deploy.yaml -n ${NAMESPACE}
$kubectl get pods -n ${NAMESPACE} -l nvidia.com/dynamo-graph-deployment-name=gpt-oss-disagg -w

Smoke Test

Send a test request to verify the deployment serves traffic:

$kubectl port-forward svc/gpt-oss-agg-frontend 8000:8000 -n ${NAMESPACE}
$kubectl port-forward svc/gpt-oss-disagg-frontend 8000:8000 -n ${NAMESPACE}
$curl http://localhost:8000/v1/chat/completions \
> -H 'Content-Type: application/json' \
> -d '{"model":"openai/gpt-oss-120b","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":64}'

Benchmark

Each target ships a perf.yaml Kubernetes Job that waits for the model to come up, then runs AIPerf with the target’s traffic shape and a request count of 10x total concurrency.

Aggregated traffic shape: ISL 128 / OSL 1000 at 900 per GPU x 4 GPUs = 3,600 total concurrency. The Job wraps this AIPerf run:

$aiperf profile \
> --model openai/gpt-oss-120b \
> --endpoint-type chat --endpoint /v1/chat/completions --streaming \
> --url http://gpt-oss-agg-frontend:8000 \
> --synthetic-input-tokens-mean 128 --output-tokens-mean 1000 \
> --extra-inputs ignore_eos:true \
> --concurrency 3600 --request-count 36000
$kubectl apply -f recipes/gpt-oss-120b/trtllm/agg/perf.yaml -n ${NAMESPACE}
$kubectl logs -f -l job-name=gpt-oss-120b-bench -n ${NAMESPACE}

Disaggregated traffic shape: ISL 8192 / OSL 1024 at 1,536 total concurrency. The Job wraps this AIPerf run:

$aiperf profile \
> --model openai/gpt-oss-120b \
> --endpoint-type chat --endpoint /v1/chat/completions --streaming \
> --url http://gpt-oss-disagg-frontend:8000 \
> --synthetic-input-tokens-mean 8192 --output-tokens-mean 1024 \
> --extra-inputs ignore_eos:true \
> --concurrency 1536 --request-count 15360
$kubectl apply -f recipes/gpt-oss-120b/trtllm/disagg/perf.yaml -n ${NAMESPACE}
$kubectl logs -f -l job-name=gpt-oss-120b-disagg-bench -n ${NAMESPACE}

Compare All Targets

AggregatedDisaggregated P/D
GPUs4x GB200 (ARM64 required)5x GB200/B200
TopologyTP4, EP4 + attention-DPTP1 prefill + TP4 decode
Workload128 ISL / 1000 OSL, 3,600 concurrency8K ISL / 1K OSL, 1,536 concurrency
QuantizationCheckpoint defaultW4A8_MXFP4_MXFP8
KV transferUCX cache transceiver

Notes

  • The aggregated target requires ARM64 (GB200) nodes; the disaggregated target accepts GB200 or B200.
  • Do not read the two targets as an aggregated-vs-disaggregated benchmark; their traffic shapes differ by design.
  • The disaggregated deployment uses 5 GPUs (1x TP1 prefill + 1x TP4 decode), while its perf.yaml computes total concurrency from a 6-GPU count (256 x 6 = 1,536); adjust DEPLOYMENT_GPU_COUNT if you want strict per-GPU normalization.
  • Disaggregated engine configs differ per role: prefill runs TP1 with max_batch_size=64 and the overlap scheduler disabled; decode runs TP4 with max_batch_size=1280 and the overlap scheduler enabled. KV transfer uses the UCX-based cache transceiver (max_tokens_in_buffer=9216).
  • The disaggregated target uses W4A8_MXFP4_MXFP8 quantization via the OVERRIDE_QUANT_ALGO environment variable.

Source