> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt.

# GPT-OSS-120B

Two validated TensorRT-LLM targets cover the two traffic shapes this model is most deployed for: aggregated expert-parallel (EP4, attention-DP) serving for short-prompt, high-concurrency traffic, and a prefill/decode split for long-context generation. They are deployment targets for different workloads, not an agg-vs-disagg comparison. Pick your target; every command on this page updates to match.

<p>
  Choose your deployment target
</p>

Target

Aggregated Recommended

<input type="radio" id="recipe-variant-disagg" name="recipe-variant" value="disagg" />

Disaggregated P/D

<b>Checkpoint</b> openai/gpt-oss-120b

<b>GPUs</b> 4x GB200 (ARM64), TP4, EP4 + attention-DP

<b>Workload</b> Short prompts, long outputs, high concurrency (128 ISL / 1000 OSL)

<b>Runtime</b> TensorRT-LLM (tensorrtllm-runtime:1.2.1)

<b>Checkpoint</b> openai/gpt-oss-120b

<b>GPUs</b> 5x GB200/B200 (TP1 prefill + TP4 decode)

<b>Workload</b> Long-context generation (8K ISL / 1K OSL)

<b>Quantization</b> W4A8\_MXFP4\_MXFP8

<b>Runtime</b> TensorRT-LLM (tensorrtllm-runtime:1.2.1)

## Prerequisites

* A Kubernetes cluster with the Dynamo platform installed and **4x GB200 available on ARM64 nodes** — the aggregated target will not run on x86 Hopper/Ampere hardware.
* A Hugging Face token with access to `openai/gpt-oss-120b`.

- A Kubernetes cluster with the Dynamo platform installed and **5x GB200 or B200** available (1 prefill + 4 decode GPUs).
- A Hugging Face token with access to `openai/gpt-oss-120b`.

Create the namespace and token secret:

```bash
export NAMESPACE=your-namespace
kubectl create namespace ${NAMESPACE}
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="your-token" \
  -n ${NAMESPACE}
```

Update `storageClassName` in `model-cache/model-cache.yaml` and the container image tag in `deploy.yaml` to match your Dynamo release before deploying. Also edit namespace, node selectors, and cluster-specific placement.

## Deploy

Prepare the model cache (shared by both targets):

```bash
kubectl apply -f recipes/gpt-oss-120b/model-cache/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
```

Then deploy:

```bash
kubectl apply -f recipes/gpt-oss-120b/trtllm/agg/deploy.yaml -n ${NAMESPACE}
```

Model loading takes roughly 15-30 minutes depending on storage speed:

```bash
kubectl apply -f recipes/gpt-oss-120b/trtllm/disagg/deploy.yaml -n ${NAMESPACE}
kubectl get pods -n ${NAMESPACE} -l nvidia.com/dynamo-graph-deployment-name=gpt-oss-disagg -w
```

## Smoke Test

Send a test request to verify the deployment serves traffic:

```bash
kubectl port-forward svc/gpt-oss-agg-frontend 8000:8000 -n ${NAMESPACE}
```

```bash
kubectl port-forward svc/gpt-oss-disagg-frontend 8000:8000 -n ${NAMESPACE}
```

```bash
curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"openai/gpt-oss-120b","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":64}'
```

## Benchmark

Each target ships a `perf.yaml` Kubernetes Job that waits for the model to come up, then runs AIPerf with the target's traffic shape and a request count of 10x total concurrency.

Aggregated traffic shape: ISL 128 / OSL 1000 at 900 per GPU x 4 GPUs = 3,600 total concurrency. The Job wraps this AIPerf run:

```bash
aiperf profile \
  --model openai/gpt-oss-120b \
  --endpoint-type chat --endpoint /v1/chat/completions --streaming \
  --url http://gpt-oss-agg-frontend:8000 \
  --synthetic-input-tokens-mean 128 --output-tokens-mean 1000 \
  --extra-inputs ignore_eos:true \
  --concurrency 3600 --request-count 36000
```

```bash
kubectl apply -f recipes/gpt-oss-120b/trtllm/agg/perf.yaml -n ${NAMESPACE}
kubectl logs -f -l job-name=gpt-oss-120b-bench -n ${NAMESPACE}
```

Disaggregated traffic shape: ISL 8192 / OSL 1024 at 1,536 total concurrency. The Job wraps this AIPerf run:

```bash
aiperf profile \
  --model openai/gpt-oss-120b \
  --endpoint-type chat --endpoint /v1/chat/completions --streaming \
  --url http://gpt-oss-disagg-frontend:8000 \
  --synthetic-input-tokens-mean 8192 --output-tokens-mean 1024 \
  --extra-inputs ignore_eos:true \
  --concurrency 1536 --request-count 15360
```

```bash
kubectl apply -f recipes/gpt-oss-120b/trtllm/disagg/perf.yaml -n ${NAMESPACE}
kubectl logs -f -l job-name=gpt-oss-120b-disagg-bench -n ${NAMESPACE}
```

## Compare All Targets

|                  | Aggregated                            | Disaggregated P/D                  |
| ---------------- | ------------------------------------- | ---------------------------------- |
| **GPUs**         | 4x GB200 (ARM64 required)             | 5x GB200/B200                      |
| **Topology**     | TP4, EP4 + attention-DP               | TP1 prefill + TP4 decode           |
| **Workload**     | 128 ISL / 1000 OSL, 3,600 concurrency | 8K ISL / 1K OSL, 1,536 concurrency |
| **Quantization** | Checkpoint default                    | W4A8\_MXFP4\_MXFP8                 |
| **KV transfer**  | —                                     | UCX cache transceiver              |

## Notes

* The aggregated target requires ARM64 (GB200) nodes; the disaggregated target accepts GB200 or B200.
* Do not read the two targets as an aggregated-vs-disaggregated benchmark; their traffic shapes differ by design.
* The disaggregated deployment uses 5 GPUs (1x TP1 prefill + 1x TP4 decode), while its `perf.yaml` computes total concurrency from a 6-GPU count (256 x 6 = 1,536); adjust `DEPLOYMENT_GPU_COUNT` if you want strict per-GPU normalization.
* Disaggregated engine configs differ per role: prefill runs TP1 with `max_batch_size=64` and the overlap scheduler disabled; decode runs TP4 with `max_batch_size=1280` and the overlap scheduler enabled. KV transfer uses the UCX-based cache transceiver (`max_tokens_in_buffer=9216`).
* The disaggregated target uses `W4A8_MXFP4_MXFP8` quantization via the `OVERRIDE_QUANT_ALGO` environment variable.

## Source

* Source README: [recipes/gpt-oss-120b/README.md](https://github.com/ai-dynamo/dynamo/blob/main/recipes/gpt-oss-120b/README.md)
* Disaggregated README: [recipes/gpt-oss-120b/trtllm/disagg/README.md](https://github.com/ai-dynamo/dynamo/blob/main/recipes/gpt-oss-120b/trtllm/disagg/README.md)
* Aggregated: [deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/gpt-oss-120b/trtllm/agg/deploy.yaml) and [perf.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/gpt-oss-120b/trtllm/agg/perf.yaml)
* Disaggregated: [deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/gpt-oss-120b/trtllm/disagg/deploy.yaml) and [perf.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/gpt-oss-120b/trtllm/disagg/perf.yaml)