Is this page helpful?

Get Started with Nemotron 3 Ultra 550B-A55B#

Nemotron 3 Ultra 550B-A55B, the largest of the Nemotron 3 models, provides state-of-the-art accuracy and reasoning performance. This text-only, reasoning-capable model has 550B total parameters and up to 55B active parameters per token. It uses a hybrid Mamba-Transformer (Nemotron-H) mixture-of-experts architecture. The NIM exposes OpenAI-compatible and Anthropic-compatible APIs, so existing clients can use the same container for Chat Completions, Responses, Anthropic Messages, tool calling, and agentic workflows.

This page is the model-specific Day-0 guide for Nemotron 3 Ultra 550B-A55B. For the generic NIM LLM onboarding path, refer to Quickstart. For full endpoint examples, refer to API Reference. For tool calling and MCP integration, refer to Tool Calling and MCP Integration.

Prerequisites#

Before deploying a NIM LLM container, ensure your environment meets the following requirements:

Hardware Requirements#

The following are the minimum required specifications for supported hardware components:

Requirement	Specification
CPU	AMD64, ARM64
GPU	Refer to Support Matrix for NIM Day 0

Software Requirements#

Minimum required versions for supported software components.

Requirement	Specification
Operating System	Ubuntu 22.04 LTS or later recommended
Container Toolkit	1.14.0 or later
CUDA SDK	12.9 or later
GPU Driver	580 or later
Docker	24.0 or later

Operating System#

While other Linux distributions can be compatible with NIM, they have not been officially validated.

We recommend using Ubuntu 22.04 LTS or later for the best experience.

CUDA SDK#

Install CUDA SDK by following the CUDA installation guide for Linux.

GPU Drivers#

Install the NVIDIA GPU drivers by following the NVIDIA Driver Installation Guide.

Docker#

Docker is required to run the containerized NIM services.

Install Docker Engine for your Linux distribution by following the Docker Engine installation guide.
Verify that the Docker daemon is running and that your user can execute docker commands without sudo. Add your user to the docker group if needed:
```
sudo groupadd docker
sudo usermod -aG docker $USER
```
Log out and back in for the group change to take effect.

Container Toolkit#

The NVIDIA Container Toolkit enables Docker containers to access the host GPU.

Install the toolkit by following the NVIDIA Container Toolkit installation guide.
Configure Docker to use the NVIDIA runtime by following the Docker configuration steps.
Restart the Docker daemon after configuration:
```
sudo systemctl restart docker
```

NIM Container Access#

To download and deploy NIM containers, you need one of the following:

A free NVIDIA Developer Program membership.
An NVIDIA AI Enterprise license. To request a free 90-day evaluation license, refer to Ways to Get Started With NVIDIA AI Enterprise and Activate Your NVIDIA AI Enterprise License.

Generate Access Credentials#

An NGC Personal API Key is required to access NVIDIA NIM containers and models hosted on NGC.

Generate the Personal API Key on the Setup API Keys page.
When creating the Personal API key, select at least NGC Catalog from the Services Included list. You can also include additional services if you want to use the same key for other purposes.

Warning

Legacy API keys are not supported by NIM. Always use a Personal API Key.

Verify NVIDIA Runtime Access#

To ensure that your setup is correct, run the following command:

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

This command should produce output similar to one of the following, where you can confirm CUDA driver version, and available GPUs.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:1B:00.0 Off |                    0 |
| N/A   36C    P0            112W /  700W |   78489MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Configuration#

Use environment variables to control authentication and model caching.

Export NGC API Key#

Export the variable in your shell (temporary), replacing <VALUE> with your actual API key:
```
export NGC_API_KEY=<VALUE>
```

Persist the variable (optional):

If using bash:

echo "export NGC_API_KEY=$NGC_API_KEY" >> ~/.bashrc

If using zsh:

echo "export NGC_API_KEY=$NGC_API_KEY" >> ~/.zshrc

Verify the variable is set:
```
echo "$NGC_API_KEY"
```

Model Cache#

NIM downloads model weights to a cache on the host that you mount into the container. Artifacts persist across restarts, so you do not pull the full model on every run.

Local Cache#

An essential variable to configure on your host system is the cache path directory. This directory is mapped from the host machine to container; assets (for example, model weights) are downloaded to this host directory and persist across container restarts. Configuring a local cache is highly recommended, as it avoids re-downloading large model files upon subsequent container restarts. You can name the environment variable containing the path to the local cache whatever you want.

Create the cache directory and export an environment variable:

export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p $LOCAL_NIM_CACHE
# Optionally add sticky bit to avoid issues writing to the cache if the container is running as a different user
chmod -R a+rwxt $LOCAL_NIM_CACHE

When you start the NIM container, you must map your host machine’s local cache directory ($LOCAL_NIM_CACHE) to the container’s internal cache path (/opt/nim/.cache) using a Docker volume mount, such as -v "$LOCAL_NIM_CACHE:/opt/nim/.cache". This mapping ensures that the large model weights downloaded by the container are saved to your host machine. Because containers are ephemeral, any data stored only inside the container is lost when it stops. By using a volume mount, subsequent container runs detect the existing model files in your local cache and skip the lengthy download process, allowing the NIM to start up faster.

Cache Directory Permissions#

The NIM container runs as a non-root user with GID 0 (root group). The cache directory on your host must be writable by GID 0:

export NIM_CACHE_PATH=/tmp/nim-cache
mkdir -p "$NIM_CACHE_PATH"
sudo chgrp -R 0 "$NIM_CACHE_PATH"
sudo chmod -R g+rwX "$NIM_CACHE_PATH"

Run the container with the cache mounted:

docker run --gpus all \
  -v "$NIM_CACHE_PATH:/opt/nim/.cache" \
  ...

To run as a custom user (e.g., your host user), pass -u <uid>:0:

docker run --gpus all -u $(id -u):0 \
  -v "$NIM_CACHE_PATH:/opt/nim/.cache" \
  ...

Important

When using -u <uid>, you must include :0 to set GID 0 (e.g., -u $(id -u):0). The container’s writable directories are group-owned by GID 0. Without it, the container will fail with PermissionError when writing to cache, config, or log paths.

Tip

To make this setting permanent across terminal sessions, you can add export LOCAL_NIM_CACHE=~/.cache/nim to your ~/.bashrc or ~/.zshrc profile.

Installation#

Before running a NIM LLM container, you must authenticate with your deployment source, accept the governing terms, and pull the container image.

Docker Login#

To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry with the following command:

echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Use $oauthtoken as the username and NGC_API_KEY as the password. The $oauthtoken username is a special name that indicates that you will authenticate with an API key and not a user name and password.

Accept the Governing Terms#

Before you download a given NIM for the first time, you must accept the governing terms in the browser. Navigate to the NGC Catalog page for the NIM and click the Accept Terms button: Nemotron 3 Ultra 550B-A55B on NGC

Pull the Container Image#

After you have generated your API key and authenticated with your deployment source, you can download the NIM container image to your host machine.

Use the docker pull command to fetch the NIM container image.

docker pull nvcr.io/nim/nvidia/nemotron-3-ultra-550b-a55b:2.0.5-variant

Storage and Startup Notes#

Nemotron 3 Ultra 550B-A55B is a large model, so make sure the host has enough free disk space for the container image and the model cache. As a rough guide, the container image is approximately 38 GB, and the model cache ranges from approximately 330 GB for an NVFP4 profile to approximately 1.1–1.7 TB for a BF16 profile, depending on precision and GPU configuration. Reserve additional space if you download multiple profiles or keep older container images on the same host.

The first launch downloads the model artifacts into the mounted cache directory, which can take a significant amount of time depending on your hardware and network. Subsequent launches reuse the mounted cache and start faster.

Tip

By default, the model download produces little log output and can appear idle. To follow the download progress, add -e NIM_LOG_LEVEL=INFO to the docker run command.

To pre-populate the cache before serving traffic, run download-to-cache with the same image, API key, and cache mount:

docker run --gpus=all \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  nvcr.io/nim/nvidia/nemotron-3-ultra-550b-a55b:2.0.5-variant \
  download-to-cache

Run NIM#

Run the container using your NGC API Key to authenticate and download the model.

docker run --gpus=all \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -p 8000:8000 \
  nvcr.io/nim/nvidia/nemotron-3-ultra-550b-a55b:2.0.5-variant

Recommended Runtime Settings#

For optimal performance, set NIM_PASSTHROUGH_ARGS to the recommended values for your GPU and precision combination (refer to the next section). NVFP4 checkpoints require additional environment variables and passthrough arguments. For more information about how NIM_PASSTHROUGH_ARGS is processed, refer to Advanced Configuration.

Per-Profile Passthrough Arguments#

The following table lists additional recommended NIM_PASSTHROUGH_ARGS values for profiles that require profile-specific overrides:

GPU	TP	PP	Precision	Recommended `NIM_PASSTHROUGH_ARGS`
B200/GB200	2	1	NVFP4	`--max-num-seqs 64`
H100	8	2	BF16	`--max-num-seqs 512`
H200	8	1	BF16	`--gpu-memory-utilization 0.98 --max-num-seqs 32 --max-num-batched-tokens 4096 --max-model-len 131072`
B300/GB300	4	1	BF16	`--gpu-memory-utilization 0.99 --max-num-seqs 4 --max-num-batched-tokens 2048 --max-model-len 131072`

Profiles not listed in this table start successfully with default settings and do not require additional NIM_PASSTHROUGH_ARGS.

Context Length#

Nemotron 3 Ultra 550B-A55B natively supports a context window of 262,144 tokens (256K). This is the default --max-model-len and the value reported by the /v1/models endpoint, so no additional configuration is required for context lengths up to 262,144 tokens.

To serve a longer context window (up to 1M tokens), set the VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 environment variable and pass the desired --max-model-len value in NIM_PASSTHROUGH_ARGS. For example, to serve a 1,048,576-token (1M) context window:

export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export NIM_PASSTHROUGH_ARGS="--max-model-len 1048576"

Note

A larger context window requires significantly more KV cache memory, which reduces the number of requests that can be served concurrently. Validate GPU memory headroom for your target hardware before serving production traffic. Context lengths beyond the native 262,144-token window extend the model past its configured positional encoding range, so validate output quality for your workload before relying on context lengths greater than 262,144 tokens.

Additional Settings for NVFP4 Checkpoints#

When running an NVFP4 checkpoint, set the following environment variables before starting the container, in addition to the per-profile NIM_PASSTHROUGH_ARGS listed in the previous table:

export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export NIM_PASSTHROUGH_ARGS="--mamba-ssm-cache-dtype float16 --enable-mamba-cache-stochastic-rounding --mamba-cache-philox-rounds 5"

Note

Do not set --kv-cache-dtype nvfp4. This model is calibrated for an FP8 KV cache, which is the recommended setting and the NIM default—leave --kv-cache-dtype unset and the NVFP4 checkpoint automatically uses an FP8 KV cache. NVFP4 KV cache is not recommended for this model: it has not been validated for quality, can degrade multi-turn and reasoning quality (excessively long reasoning that can exhaust the token budget and return empty responses), and can slightly reduce throughput even at large batch sizes. Only the KV-cache storage dtype is affected; the model weights remain NVFP4.

Append the per-profile values from the previous table to the NIM_PASSTHROUGH_ARGS value above. For example, on a B200 GPU with TP=2 and precision NVFP4:

export NIM_PASSTHROUGH_ARGS="--mamba-ssm-cache-dtype float16 --enable-mamba-cache-stochastic-rounding --mamba-cache-philox-rounds 5 --max-num-seqs 64"

Speculative Decoding with Multi-Token Prediction (MTP)#

Nemotron 3 Ultra 550B-A55B has a built-in Multi-Token Prediction (MTP) module that vLLM uses as a draft model for speculative decoding, reducing per-token latency at low concurrency. Set num_speculative_tokens to the number of tokens to draft per step:

export NIM_PASSTHROUGH_ARGS="--speculative-config '{\"method\":\"mtp\",\"num_speculative_tokens\":2}'"

Higher values draft more tokens per step but accept fewer at deeper positions (diminishing returns); 1–3 is a good range.

Important

Wrap the JSON in single quotes exactly as shown. NIM_PASSTHROUGH_ARGS is tokenized before it reaches vLLM, so unquoted JSON loses its double quotes and vLLM rejects it (--speculative-config ... cannot be converted).

MTP helps most at low concurrency. For speculative-decoding behavior, metrics, and limitations, refer to the vLLM documentation.

The throughput improvements in the release notes were measured with the following configuration. Set VLLM_SSM_CONV_STATE_LAYOUT=DS and the align Mamba cache mode, and tune --max-num-seqs for your workload (72 for chat, 40 for software-engineering workloads):

export VLLM_SSM_CONV_STATE_LAYOUT=DS
export NIM_PASSTHROUGH_ARGS="--max-model-len 262144 --enable-prefix-caching --mamba-cache-mode align --max-num-batched-tokens 32768 --block-size 64 --max-num-seqs 72 --speculative-config '{\"method\":\"mtp\",\"num_speculative_tokens\":1}'"

Example: Pass Settings to `docker run`#

The following example combines the recommended environment variables and passthrough arguments with the base docker run command shown in the previous section:

docker run --gpus=all \
  -e NGC_API_KEY=$NGC_API_KEY \
  -e VLLM_USE_FLASHINFER_MOE_FP8=1 \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e NIM_PASSTHROUGH_ARGS="--mamba-ssm-cache-dtype float16 --enable-mamba-cache-stochastic-rounding --mamba-cache-philox-rounds 5 --max-num-seqs 64" \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -p 8000:8000 \
  nvcr.io/nim/nvidia/nemotron-3-ultra-550b-a55b:2.0.5-variant

Interact with the API#

The NIM exposes OpenAI-compatible and Anthropic-compatible inference endpoints. The OpenAI-compatible endpoints are the following:

Chat Completions: /v1/chat/completions
Text Completions: /v1/completions
Responses: /v1/responses

The Anthropic-compatible endpoints are the following:

Messages: /v1/messages
Count Tokens: /v1/messages/count_tokens

Tip

Chat Completions, Text Completions, Responses, and Anthropic Messages support streaming.

Note

To carry context across turns, send the prior turns inline in the input array. To enable server-side storage so that previous_response_id and response retrieval work, start the container with VLLM_ENABLE_RESPONSES_API_STORE=1. Memory usage will be impacted.

Send a Chat Completion Request#

After the server is running, you can send a request to the chat completion endpoint:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-ultra-550b-a55b",
    "messages": [
      {
        "role": "user",
        "content": "Hello! How are you?"
      }
    ],
    "max_tokens": 128,
    "temperature": 0.0
  }'

The response has the following general format. The exact generated text can vary by sampling settings and runtime configuration. If the container is started with --reasoning-parser nemotron_v3, the reasoning field can contain parsed reasoning text; otherwise it can be null.

{
  "id": "chatcmpl-87d0c4524fb6f1a4",
  "object": "chat.completion",
  "created": 1769635152,
  "model": "nvidia/nemotron-3-ultra-550b-a55b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I'm ready to help.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": "..."
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 41,
    "total_tokens": 46,
    "completion_tokens": 5,
    "prompt_tokens_details": null,
    "completion_tokens_details": {
      "reasoning_tokens": 2
    }
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

Hold a Multi-Turn Conversation#

The Chat Completions endpoint is stateless. To carry context across turns, resend the full conversation in the messages array, alternating user and assistant turns after an optional system message:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-ultra-550b-a55b",
    "messages": [
      {"role": "system", "content": "You are a concise assistant."},
      {"role": "user", "content": "My favorite color is blue."},
      {"role": "assistant", "content": "Noted — your favorite color is blue."},
      {"role": "user", "content": "What did I just tell you?"}
    ],
    "max_tokens": 128,
    "temperature": 0.0
  }'

Note

Append only the assistant content from prior turns to the history. When the reasoning parser (--reasoning-parser nemotron_v3) is enabled, do not feed the reasoning field back into messages.

Use the OpenAI Python SDK#

You can direct the OpenAI Python SDK at the NIM endpoint by setting base_url to the local /v1 API path and providing any non-empty API key:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-used",
)

response = client.chat.completions.create(
    model="nvidia/nemotron-3-ultra-550b-a55b",
    messages=[{"role": "user", "content": "Summarize GPU computing in one sentence."}],
    max_tokens=128,
    temperature=0.0,
)

print(response.choices[0].message.content)

For Anthropic Python SDK examples, refer to Messages (Anthropic-compatible).

Control Thinking Budget#

To expose parsed reasoning output and use thinking controls, start the container with the Nemotron 3 reasoning parser:

docker run --gpus=all \
  -e NGC_API_KEY=$NGC_API_KEY \
  -e NIM_PASSTHROUGH_ARGS="--reasoning-parser nemotron_v3" \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -p 8000:8000 \
  nvcr.io/nim/nvidia/nemotron-3-ultra-550b-a55b:2.0.5-variant

Then enable thinking and set a thinking-token budget in the request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-ultra-550b-a55b",
    "messages": [
      {
        "role": "user",
        "content": "I have a 3x3 grid of integers. Rows sum to 15, 18, 21. Columns sum to 12, 20, 22. Center is 7, top-left is 2. Find one valid grid."
      }
    ],
    "max_tokens": 2048,
    "temperature": 0,
    "seed": 42,
    "chat_template_kwargs": {"enable_thinking": true},
    "thinking_token_budget": 10
  }'

The thinking_token_budget value limits the thinking portion of generation. The max_tokens value still caps total generated tokens for the request. When the reasoning parser is enabled, the response reports how many of the generated tokens were reasoning tokens in usage.completion_tokens_details.reasoning_tokens (completion_tokens remains the combined total of reasoning tokens and content tokens).

Note

thinking_token_budget is supported on the Chat Completions endpoint (/v1/chat/completions). The Responses endpoint (/v1/responses) does not enforce a thinking-token budget.

Enable Tool Calling and MCP Workflows#

For Nemotron 3 Ultra 550B-A55B, enable OpenAI-compatible tool calling by adding the following arguments to NIM_PASSTHROUGH_ARGS before starting the container:

export NIM_PASSTHROUGH_ARGS="--enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser nemotron_v3"

If you already use NIM_PASSTHROUGH_ARGS for profile-specific settings, append these arguments to the same string. Both --enable-auto-tool-choice and --tool-call-parser qwen3_coder are required when using "tool_choice": "auto". The --reasoning-parser nemotron_v3 setting enables the built-in reasoning parser for Nemotron 3 output.

The following request provides multiple tool choices and lets the model choose which one to call:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-ultra-550b-a55b",
    "messages": [
      {
        "role": "user",
        "content": "What is the weather in Santa Clara, CA?"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get the current weather for a city.",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {"type": "string"}
            },
            "required": ["location"]
          }
        }
      },
      {
        "type": "function",
        "function": {
          "name": "search_docs",
          "description": "Search internal documentation.",
          "parameters": {
            "type": "object",
            "properties": {
              "query": {"type": "string"}
            },
            "required": ["query"]
          }
        }
      }
    ],
    "tool_choice": "auto",
    "max_tokens": 256
  }'

A successful tool-calling response includes a tool_calls array under choices[0].message. Your application executes the selected tool and sends the tool result back to the model in a follow-up Chat Completions request.

For MCP, connect to MCP servers in your client application, convert the MCP tool schemas to the OpenAI tools format, and pass them to /v1/chat/completions. The NIM container does not connect to MCP servers directly. For details and LangChain/LangGraph examples, refer to Tool Calling and MCP Integration.

Structured JSON Output#

For structured-output use cases, request JSON mode through the OpenAI-compatible response_format parameter and validate the response with your preferred schema library, such as Pydantic:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-ultra-550b-a55b",
    "messages": [
      {
        "role": "user",
        "content": "Return JSON with keys name and purpose for NVIDIA NIM."
      }
    ],
    "response_format": {"type": "json_object"},
    "max_tokens": 128,
    "temperature": 0.0
  }'

Frameworks that support OpenAI-compatible chat completions, such as LangChain, LangGraph, LlamaIndex, Pipecat, OpenCode, and similar agent frameworks, can use the local NIM endpoint by setting their base URL to http://localhost:8000/v1 and using the served model name.

Verify Health Endpoints#

You can verify that the NIM container is running and ready to accept requests by checking its health endpoints. By default, these endpoints are served on port 8000. If you set NIM_HEALTH_PORT, use that port instead.

Live Endpoint#

Perform a liveness check to see if the server is running:

curl -v http://localhost:8000/v1/health/live

Example response:

GET /v1/health/live HTTP/1.1
Host: localhost:8000
User-Agent: curl/7.81.0
Accept: */*

HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Content-Type: application/json
Content-Length: 61
Connection: keep-alive
Cache-Control: no-store, no-cache, must-revalidate

{
  "object": "health.response",
  "message": "live",
  "status": "live"
}

Ready Endpoint#

Perform a readiness check to see if the model is fully loaded and ready for inference:

curl -v http://localhost:8000/v1/health/ready

Example response:

GET /v1/health/ready HTTP/1.1
Host: localhost:8000
User-Agent: curl/7.81.0
Accept: */*

HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Content-Type: application/json
Content-Length: 63
Connection: keep-alive
Cache-Control: no-store, no-cache, must-revalidate

{
  "object": "health.response",
  "message": "ready",
  "status": "ready"
}

Streaming#

To receive responses incrementally as they are generated, you can enable streaming by adding "stream": true to your request payload. This is supported across the /v1/chat/completions, /v1/completions, /v1/responses, and /v1/messages endpoints.

When streaming is enabled, the API returns a sequence of Server-Sent Events (SSE).

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-ultra-550b-a55b",
    "messages": [
      {
        "role": "user",
        "content": "Write a short poem about a robot."
      }
    ],
    "max_tokens": 100,
    "stream": true
  }'

For Chat Completions and Text Completions, the response is streamed back in chunks, with each chunk containing a data JSON object. These streams terminate with a data: [DONE] message:

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-ultra-550b-a55b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-ultra-550b-a55b","choices":[{"index":0,"delta":{"content":"In"},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-ultra-550b-a55b","choices":[{"index":0,"delta":{"content":" cir"},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-ultra-550b-a55b","choices":[{"index":0,"delta":{"content":"cuits"},"logprobs":null,"finish_reason":null,"token_ids":null}]}

...

data: [DONE]

For the Responses API, stream events use typed SSE events such as response.output_text.delta and terminate with response.completed. For Anthropic-compatible Messages, stream events use Anthropic event names such as content_block_delta and terminate with message_stop.

Get Started with Nemotron 3 Ultra 550B-A55B#

Prerequisites#

Hardware Requirements#

Software Requirements#

Operating System#

CUDA SDK#

GPU Drivers#

Docker#

Container Toolkit#

NIM Container Access#

Generate Access Credentials#

Verify NVIDIA Runtime Access#

Configuration#

Export NGC API Key#

Model Cache#

Local Cache#

Cache Directory Permissions#

Installation#

Docker Login#

Accept the Governing Terms#

Pull the Container Image#

Storage and Startup Notes#

Run NIM#

Recommended Runtime Settings#

Per-Profile Passthrough Arguments#

Context Length#

Additional Settings for NVFP4 Checkpoints#

Speculative Decoding with Multi-Token Prediction (MTP)#

Example: Pass Settings to docker run#

Interact with the API#

Send a Chat Completion Request#

Hold a Multi-Turn Conversation#

Use the OpenAI Python SDK#

Control Thinking Budget#

Enable Tool Calling and MCP Workflows#

Structured JSON Output#

Verify Health Endpoints#

Live Endpoint#

Ready Endpoint#

Streaming#

Example: Pass Settings to `docker run`#