Additional ResourcesTensorRT-LLM Details
GPT-OSS
GPT-OSS
GPT-OSS
For general TensorRT-LLM features and configuration, see the Reference Guide.
Dynamo supports disaggregated serving of gpt-oss-120b with TensorRT-LLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single B200 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.
This deployment uses disaggregated serving in TensorRT-LLM where:
The disaggregated approach optimizes for both low-latency (maximizing tokens per second per user) and high-throughput (maximizing total tokens per GPU per second) use cases by separating the compute-intensive prefill phase from the memory-bound decode phase.
Ensure that the etcd and nats services are running with the following command:
Set the container image:
Launch the Dynamo TensorRT-LLM container with the necessary configurations:
This command:
--rm)--ipc=host)-it)/model/workspace/dynamoThe deployment uses configuration files and command-line arguments to control behavior:
Prefill Configuration (examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml):
enable_attention_dp: false - Attention data parallelism disabled for prefillenable_chunked_prefill: true - Enables efficient chunked prefill processingmoe_config.backend: CUTLASS - Uses optimized CUTLASS kernels for MoE layerscache_transceiver_config.backend: ucx - Uses UCX for efficient KV cache transfercuda_graph_config.max_batch_size: 32 - Maximum batch size for CUDA graphsDecode Configuration (examples/backends/trtllm/engine_configs/gpt-oss-120b/decode.yaml):
enable_attention_dp: true - Attention data parallelism enabled for decodedisable_overlap_scheduler: false - Enables overlapping for decode efficiencymoe_config.backend: CUTLASS - Uses optimized CUTLASS kernels for MoE layerscache_transceiver_config.backend: ucx - Uses UCX for efficient KV cache transfercuda_graph_config.max_batch_size: 128 - Maximum batch size for CUDA graphsBoth workers receive these key arguments:
--tensor-parallel-size 4 - Uses 4 GPUs for tensor parallelism--expert-parallel-size 4 - Expert parallelism across 4 GPUs--free-gpu-memory-fraction 0.9 - Allocates 90% of GPU memoryPrefill-specific arguments:
--max-num-tokens 20000 - Maximum tokens for prefill processing--max-batch-size 32 - Maximum batch size for prefillDecode-specific arguments:
--max-num-tokens 16384 - Maximum tokens for decode processing--max-batch-size 128 - Maximum batch size for decodeNote that GPT-OSS is a reasoning model with tool calling support. To ensure the response is being processed correctly, the worker should be launched with proper --dyn-reasoning-parser and --dyn-tool-call-parser.
You can use the provided launch script or run the components manually:
Poll the /health endpoint to verify that both the prefill and decode worker endpoints have started:
Make sure that both of the endpoints are available before sending an inference request:
If only one worker endpoint is listed, the other may still be starting up. Monitor the worker logs to track startup progress.
Send a test request to verify the deployment:
The server exposes a standard OpenAI-compatible API endpoint that accepts JSON requests. You can adjust parameters like max_tokens, temperature, and others according to your needs.
Dynamo has supported reasoning and tool calling in OpenAI Chat Completion endpoint. A typical workflow for application built on top of Dynamo is that the application has a set of tools to aid the assistant provide accurate answer, and it is usually multi-turn as it involves tool selection and generation based on the tool result.
In addition, the reasoning effort can be configured through chat_template_args. Increasing the reasoning effort makes the model more accurate but also slower. It supports three levels: low, medium, and high.
Below is an example of sending multi-round requests to complete a user query with reasoning and tool calling: Application setup (pseudocode)
First request with tools
First response with tool choice
Second request with tool calling result
Second response with final message
The Dynamo container includes AIPerf, NVIDIA’s tool for benchmarking generative AI models. This tool helps measure throughput, latency, and other performance metrics for your deployment.
Run the following benchmark from inside the container (after completing the deployment steps above):
This command:
/tmp/benchmark-results for analysisKey parameters you can adjust:
--concurrency: Number of simultaneous requests (impacts GPU utilization)--synthetic-input-tokens-mean: Average input length (tests prefill capacity)--output-tokens-mean: Average output length (tests decode throughput)--request-count: Total number of requests for the benchmarkIf you prefer to run benchmarks from outside the container:
The disaggregated architecture separates prefill and decode phases:
CUDA Out-of-Memory Errors
--max-num-tokens in the launch commands (currently 20000 for prefill, 16384 for decode)--free-gpu-memory-fraction from 0.9 to 0.8 or 0.7Workers Not Connecting
docker ps | grep -E "(etcd|nats)"Performance Issues
nvidia-smi while the deployment is runningContainer Startup Issues
Token Repetition / Generation Won’t Stop
reasoning_effort: high, the model may produce repeated tokens and fail to stoptop_p=1 in your request. These are the recommended sampling parameters from OpenAI