Get Started with GPT-OSS-120b-Turbo#

This container provides optimized serving profiles for RAG and agentic workloads on B200 and H200 GPUs.

Prerequisites#

Before deploying a NIM LLM container, ensure your environment meets the following requirements:

Hardware Requirements#

This NIM offers optimized performance through a custom vLLM backend for a limited set of GPUs.

GPUs built on the ARM architecture are not supported.

The GPU Memory and Disk Space values are in GB.

GPT-OSS-120b-Turbo Support Matrix#
GPU	GPU Memory	Precision	Number of GPUs	Disk Space
B200	192	MXFP4	1	99
B200	192	MXFP4	2	99
H200	141	MXFP4	4	107

Software Requirements#

Minimum required versions for supported software components.

Requirement	Specification
Operating System	Ubuntu 22.04 LTS or later recommended
Container Toolkit	1.14.0 or later
CUDA SDK	12.9 or later
GPU Driver	580 or later
Docker	24.0 or later

Operating System#

While other Linux distributions can be compatible with NIM, they have not been officially validated.

We recommend using Ubuntu 22.04 LTS or later for the best experience.

CUDA SDK#

Install CUDA SDK by following the CUDA installation guide for Linux.

GPU Drivers#

Install the NVIDIA GPU drivers by following the NVIDIA Driver Installation Guide.

Docker#

Docker is required to run the containerized NIM services.

Install Docker Engine for your Linux distribution by following the Docker Engine installation guide.
Verify that the Docker daemon is running and that your user can execute docker commands without sudo. Add your user to the docker group if needed:
```
sudo groupadd docker
sudo usermod -aG docker $USER
```
Log out and back in for the group change to take effect.

Container Toolkit#

The NVIDIA Container Toolkit enables Docker containers to access the host GPU.

Install the toolkit by following the NVIDIA Container Toolkit installation guide.
Configure Docker to use the NVIDIA runtime by following the Docker configuration steps.
Restart the Docker daemon after configuration:
```
sudo systemctl restart docker
```

NIM Container Access#

To download and deploy NIM containers, you need one of the following:

A free NVIDIA Developer Program membership.
An NVIDIA AI Enterprise license. To request a free 90-day evaluation license, refer to Ways to Get Started With NVIDIA AI Enterprise and Activate Your NVIDIA AI Enterprise License.

Generate Access Credentials#

An NGC Personal API Key is required to access NVIDIA NIM containers and models hosted on NGC.

Generate the Personal API Key on the Setup API Keys page.
When creating the Personal API key, select at least NGC Catalog from the Services Included list. You can also include additional services if you want to use the same key for other purposes.

Warning

Legacy API keys are not supported by NIM. Always use a Personal API Key.

Verify NVIDIA Runtime Access#

To ensure that your setup is correct, run the following command:

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

This command should produce output similar to one of the following, where you can confirm CUDA driver version, and available GPUs.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:00:07.0 Off |                    0 |
| N/A   36C    P0            112W /  700W |   0MiB /  81559MiB     |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Configuration#

Use environment variables to control authentication and model caching.

Export NGC API Key#

To use your API key when starting the NIM container, you must make it available as an environment variable.

Export the variable in your shell (temporary), replacing YOUR_API_KEY with your actual API key:
```
export NGC_API_KEY=${YOUR_API_KEY}
```

Persist the variable (optional):

If using bash:

echo "export NGC_API_KEY=$NGC_API_KEY" >> ~/.bashrc

If using zsh:

echo "export NGC_API_KEY=$NGC_API_KEY" >> ~/.zshrc

Verify the variable is set:
```
echo "$NGC_API_KEY"
```

Model Cache#

NIM downloads model weights to a cache on the host that you mount into the container. Artifacts persist across restarts, so you do not pull the full model on every run.

Local Cache#

An essential variable to configure on your host system is the cache path directory. This directory is mapped from the host machine to container; assets (for example, model weights) are downloaded to this host directory and persist across container restarts. Configuring a local cache is highly recommended, as it avoids re-downloading large model files upon subsequent container restarts. You can name the environment variable containing the path to the local cache whatever you want.

Create the cache directory and export an environment variable:

export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p $LOCAL_NIM_CACHE
# Optionally add sticky bit to avoid issues writing to the cache if the container is running as a different user
chmod -R a+rwxt $LOCAL_NIM_CACHE

When you start the NIM container, you must map your host machine’s local cache directory ($LOCAL_NIM_CACHE) to the container’s internal cache path (/opt/nim/.cache) using a Docker volume mount, such as -v "$LOCAL_NIM_CACHE:/opt/nim/.cache". This mapping ensures that the large model weights downloaded by the container are saved to your host machine. Because containers are ephemeral, any data stored only inside the container is lost when it stops. By using a volume mount, subsequent container runs detect the existing model files in your local cache and skip the lengthy download process, allowing the NIM to start up faster.

Cache Directory Permissions#

The NIM container runs as a non-root user with GID 0 (root group). The cache directory on your host must be writable by GID 0:

export NIM_CACHE_PATH=/tmp/nim-cache
mkdir -p "$NIM_CACHE_PATH"
sudo chgrp -R 0 "$NIM_CACHE_PATH"
sudo chmod -R g+rwX "$NIM_CACHE_PATH"

Run the container with the cache mounted:

docker run --gpus all \
  -v "$NIM_CACHE_PATH:/opt/nim/.cache" \
  ...

To run as a custom user (e.g., your host user), pass -u <uid>:0:

docker run --gpus all -u $(id -u):0 \
  -v "$NIM_CACHE_PATH:/opt/nim/.cache" \
  ...

Important

When using -u <uid>, you must include :0 to set GID 0 (e.g., -u $(id -u):0). The container’s writable directories are group-owned by GID 0. Without it, the container will fail with PermissionError when writing to cache, config, or log paths.

Tip

To make this setting permanent across terminal sessions, you can add export LOCAL_NIM_CACHE=~/.cache/nim to your ~/.bashrc or ~/.zshrc profile.

Installation#

Before running a NIM container, you must authenticate with your deployment source, accept the governing terms, and pull the container image.

Docker Login#

To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry with the following command:

echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Use $oauthtoken as the username and NGC_API_KEY as the password. The $oauthtoken username is a special name that indicates that you will authenticate with an API key and not a user name and password.

Accept the Governing Terms#

Before you download a given NIM for the first time, you must accept the governing terms in the browser. Navigate to the NGC Catalog page for the NIM and click the Accept Terms button: GPT-OSS-120b-Turbo on NGC

Pull the Container Image#

After you generate your API key and authenticate with your deployment source, you can download the NIM container image to your host machine.

Export the container image as a shell variable so that you can reuse it in later docker commands:

export NIM_IMAGE=nvcr.io/nim/openai/gpt-oss-120b-turbo:1.0.0

Use the docker pull command to fetch the NIM container image.

docker pull $NIM_IMAGE

Run NIM#

Run the container using your NGC API Key to authenticate and download the model.

docker run --gpus=all \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -p 8000:8000 \
  $NIM_IMAGE

Recommended Runtime Settings#

This NIM provides four recommended serving profiles. Choose the profile that matches your GPU and workload:

RAG: per-request input sequence length 2,400, output sequence length 1,000, and shared system prompt length 5,600
Agentic: per-request input sequence length 6,400, output sequence length 400, and shared system prompt length 57,600

GPT-OSS-120b-Turbo Runtime Profiles#
GPU	Use Case	GPUs per Instance	Parallelism	Max Num Seqs	Max Batched Tokens	KV Cache Dtype
B200	RAG	1	TP=1	512	32,768	FP8
B200	Agentic	2	TP=2	72	49,152	FP8
H200	RAG	4	TP=4	512	65,536	Auto
H200	Agentic	4	TP=4	72	65,536	Auto

Set the environment variables and NIM_PASSTHROUGH_ARGS for one of the following profiles before starting the container. For more information about how NIM_PASSTHROUGH_ARGS is processed, refer to Advanced Configuration on the NIM documentation site.

B200 RAG Profile#

Set the following environment variables and arguments for this profile:

export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
export NIM_PASSTHROUGH_ARGS="\
  --tensor-parallel-size 1 \
  --max-num-seqs 512 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.9 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-cudagraph-capture-size 1024 \
  --kv-cache-dtype fp8 \
  --async-scheduling \
  --stream-interval 20 \
  --compilation-config '{\"pass_config\":{\"fuse_allreduce_rms\":true,\"eliminate_noops\":true}}'"

B200 Agentic Profile#

Set the following environment variables and arguments for this profile:

export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
export NIM_PASSTHROUGH_ARGS="\
  --tensor-parallel-size 2 \
  --max-num-seqs 72 \
  --max-num-batched-tokens 49152 \
  --gpu-memory-utilization 0.9 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-cudagraph-capture-size 144 \
  --kv-cache-dtype fp8 \
  --stream-interval 20 \
  --compilation-config '{\"pass_config\":{\"fuse_allreduce_rms\":true,\"eliminate_noops\":true}}'"

H200 RAG Profile#

Set the following environment variables and arguments for this profile:

export VLLM_MXFP4_USE_MARLIN=1
export NIM_PASSTHROUGH_ARGS="\
  --served-model-name openai/gpt-oss-120b \
  --tensor-parallel-size 4 \
  --kv-cache-dtype auto \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.9 \
  --max-num-batched-tokens 65536 \
  --max-num-seqs 512 \
  --max-cudagraph-capture-size 1024 \
  --stream-interval 20 \
  --async-scheduling"

H200 Agentic Profile#

Set the following environment variables and arguments for this profile:

export VLLM_MXFP4_USE_MARLIN=1
export NIM_PASSTHROUGH_ARGS="\
  --served-model-name openai/gpt-oss-120b \
  --tensor-parallel-size 4 \
  --kv-cache-dtype auto \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.9 \
  --max-num-batched-tokens 65536 \
  --max-num-seqs 72 \
  --max-cudagraph-capture-size 144 \
  --stream-interval 20 \
  --async-scheduling"

Note

The H200 RAG and H200 Agentic configurations use the same profile, vllm-nvidia-h200-mxfp4-tp4-pp1. They differ only in the NIM_PASSTHROUGH_ARGS values.

For H200 deployments, if startup fails during NCCL network plugin initialization, set NCCL_IB_DISABLE=1 before launching the container.

Example: Pass Settings to Docker Run#

After you export the variables for the selected profile, pass them to the container. The following example uses the B200 RAG profile on one GPU.

The --gpus flag selects the devices, --ipc=host and --shm-size=64g provide shared memory for large batches, and the --ulimit flags allow locked memory and a larger stack.

docker run --rm \
  --gpus '"device=0"' \
  --ipc=host --shm-size=64g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -e NGC_API_KEY=$NGC_API_KEY \
  -e VLLM_USE_FLASHINFER \
  -e VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8 \
  -e NIM_PASSTHROUGH_ARGS \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -p 8000:8000 \
  $NIM_IMAGE

For B200 Agentic, use two GPUs, for example --gpus '"device=0,1"'. For H200 profiles, use four GPUs, for example --gpus '"device=0,1,2,3"', and pass the H200-specific environment variables shown previously.

Interact with the API#

There are two main inference endpoints:

Chat Completions: /v1/chat/completions
Text Completions: /v1/completions

Tip

Both endpoints support streaming.

Verify Health Endpoints#

You can verify that the NIM container is running and ready to accept requests by checking its health endpoints. By default, these endpoints are served on port 8000.

Live Endpoint#

Perform a liveness check to determine whether the server is running:

curl -v http://localhost:8000/v1/health/live

Example response:

GET /v1/health/live HTTP/1.1
Host: localhost:8000
User-Agent: curl/7.81.0
Accept: */*

HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Content-Type: application/json
Content-Length: 61
Connection: keep-alive
Cache-Control: no-store, no-cache, must-revalidate

{
  "object": "health.response",
  "message": "live",
  "status": "live"
}

Ready Endpoint#

Perform a readiness check to determine whether the model is fully loaded and ready for inference:

curl -v http://localhost:8000/v1/health/ready

Example response:

GET /v1/health/ready HTTP/1.1
Host: localhost:8000
User-Agent: curl/7.81.0
Accept: */*

HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Content-Type: application/json
Content-Length: 63
Connection: keep-alive
Cache-Control: no-store, no-cache, must-revalidate

{
  "object": "health.response",
  "message": "ready",
  "status": "ready"
}

Send a Chat Completion Request#

After the server is running, you can send a request to the chat completion endpoint:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-120b-turbo",
    "messages": [
      {
        "role": "user",
        "content": "Write a concise summary of retrieval augmented generation."
      }
    ],
    "max_tokens": 512,
    "temperature": 1.0,
    "top_p": 1.0
  }'

Streaming#

To receive responses incrementally as they are generated, you can enable streaming by adding "stream": true to your request payload. This is supported on both inference endpoints.

When streaming is enabled, the API returns a sequence of Server-Sent Events (SSE). The stream terminates with a data: [DONE] message.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-120b-turbo",
    "messages": [
      {
        "role": "user",
        "content": "List three ways agents use tools."
      }
    ],
    "max_tokens": 512,
    "temperature": 1.0,
    "top_p": 1.0,
    "stream": true
  }'

Function (Tool) Calling#

To enable tool calling, refer to the NIM documentation site.

Performance Benchmarks#

For more information on performance benchmarks, refer to the Performance Explorer.