Running gpt-oss-120b Disaggregated with SGLang#

The gpt-oss-120b guide for SGLang is largely identical to the guide for vLLM, please ues the vLLM guide as a reference with the different deployment steps as highlighted below:

Launch the Deployment#

Note that GPT-OSS is a reasoning model with tool calling support. To ensure the response is being processed correctly, the worker should be launched with proper --dyn-reasoning-parser and --dyn-tool-call-parser.

Start frontend

python3 -m dynamo.frontend --http-port 8000 &

Run decode worker

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m dynamo.sglang \
  --model-path openai/gpt-oss-120b \
  --served-model-name openai/gpt-oss-120b \
  --tp 4 \
  --trust-remote-code \
  --skip-tokenizer-init \
  --disaggregation-mode decode \
  --disaggregation-transfer-backend nixl \
  --dyn-reasoning-parser gpt_oss \
  --dyn-tool-call-parser harmony

Run prefill workers

CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m dynamo.sglang \
  --model-path openai/gpt-oss-120b \
  --served-model-name openai/gpt-oss-120b \
  --tp 4 \
  --trust-remote-code \
  --skip-tokenizer-init \
  --disaggregation-mode prefill \
  --disaggregation-transfer-backend nixl \
  --dyn-reasoning-parser gpt_oss \
  --dyn-tool-call-parser harmony