Get Started with Nemotron 3 Ultra 550B-A55B#
Nemotron 3 Ultra 550B-A55B, the largest of the Nemotron 3 models, provides state-of-the-art accuracy and reasoning performance. This text-only, reasoning-capable model has 550B total parameters and up to 55B active parameters per token. It uses a hybrid Mamba-Transformer (Nemotron-H) mixture-of-experts architecture. The NIM exposes OpenAI-compatible and Anthropic-compatible APIs, so existing clients can use the same container for Chat Completions, Responses, Anthropic Messages, tool calling, and agentic workflows.
This page is the model-specific Day-0 guide for Nemotron 3 Ultra 550B-A55B. For the generic NIM LLM onboarding path, refer to Quickstart. For full endpoint examples, refer to API Reference. For tool calling and MCP integration, refer to Tool Calling and MCP Integration.
Prerequisites#
Before deploying a NIM LLM container, ensure your environment meets the following requirements:
Hardware Requirements#
The following are the minimum required specifications for supported hardware components:
Requirement |
Specification |
|---|---|
CPU |
AMD64, ARM64 |
GPU |
Refer to Support Matrix for NIM Day 0 |
Software Requirements#
Minimum required versions for supported software components.
Requirement |
Specification |
|---|---|
Operating System |
Ubuntu 22.04 LTS or later recommended |
Container Toolkit |
1.14.0 or later |
CUDA SDK |
12.9 or later |
GPU Driver |
580 or later |
Docker |
24.0 or later |
Operating System#
While other Linux distributions can be compatible with NIM, they have not been officially validated.
We recommend using Ubuntu 22.04 LTS or later for the best experience.
CUDA SDK#
Install CUDA SDK by following the CUDA installation guide for Linux.
GPU Drivers#
Install the NVIDIA GPU drivers by following the NVIDIA Driver Installation Guide.
Docker#
Docker is required to run the containerized NIM services.
Install Docker Engine for your Linux distribution by following the Docker Engine installation guide.
Verify that the Docker daemon is running and that your user can execute
dockercommands withoutsudo. Add your user to thedockergroup if needed:sudo groupadd docker sudo usermod -aG docker $USER
Log out and back in for the group change to take effect.
Container Toolkit#
The NVIDIA Container Toolkit enables Docker containers to access the host GPU.
Install the toolkit by following the NVIDIA Container Toolkit installation guide.
Configure Docker to use the NVIDIA runtime by following the Docker configuration steps.
Restart the Docker daemon after configuration:
sudo systemctl restart docker
NIM Container Access#
To download and deploy NIM containers, you need one of the following:
A free NVIDIA Developer Program membership.
An NVIDIA AI Enterprise license. To request a free 90-day evaluation license, refer to Ways to Get Started With NVIDIA AI Enterprise and Activate Your NVIDIA AI Enterprise License.
Generate Access Credentials#
An NGC Personal API Key is required to access NVIDIA NIM containers and models hosted on NGC.
Generate the Personal API Key on the Setup API Keys page.
When creating the Personal API key, select at least NGC Catalog from the Services Included list. You can also include additional services if you want to use the same key for other purposes.
Warning
Legacy API keys are not supported by NIM. Always use a Personal API Key.
Verify NVIDIA Runtime Access#
To ensure that your setup is correct, run the following command:
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
This command should produce output similar to one of the following, where you can confirm CUDA driver version, and available GPUs.
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | 0 |
| N/A 36C P0 112W / 700W | 78489MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Configuration#
Use environment variables to control authentication and model caching.
Export NGC API Key#
Export the variable in your shell (temporary), replacing
<VALUE>with your actual API key:export NGC_API_KEY=<VALUE>
Persist the variable (optional):
If using bash:
echo "export NGC_API_KEY=$NGC_API_KEY" >> ~/.bashrc
If using zsh:
echo "export NGC_API_KEY=$NGC_API_KEY" >> ~/.zshrc
Verify the variable is set:
echo "$NGC_API_KEY"
Model Cache#
NIM downloads model weights to a cache on the host that you mount into the container. Artifacts persist across restarts, so you do not pull the full model on every run.
Local Cache#
An essential variable to configure on your host system is the cache path directory. This directory is mapped from the host machine to container; assets (for example, model weights) are downloaded to this host directory and persist across container restarts. Configuring a local cache is highly recommended, as it avoids re-downloading large model files upon subsequent container restarts. You can name the environment variable containing the path to the local cache whatever you want.
Create the cache directory and export an environment variable:
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p $LOCAL_NIM_CACHE
# Optionally add sticky bit to avoid issues writing to the cache if the container is running as a different user
chmod -R a+rwxt $LOCAL_NIM_CACHE
When you start the NIM container, you must map your host machine’s local cache directory ($LOCAL_NIM_CACHE) to the container’s internal cache path (/opt/nim/.cache) using a Docker volume mount, such as -v "$LOCAL_NIM_CACHE:/opt/nim/.cache". This mapping ensures that the large model weights downloaded by the container are saved to your host machine. Because containers are ephemeral, any data stored only inside the container is lost when it stops. By using a volume mount, subsequent container runs detect the existing model files in your local cache and skip the lengthy download process, allowing the NIM to start up faster.
Cache Directory Permissions#
The NIM container runs as a non-root user with GID 0 (root group). The cache directory on your host must be writable by GID 0:
export NIM_CACHE_PATH=/tmp/nim-cache
mkdir -p "$NIM_CACHE_PATH"
sudo chgrp -R 0 "$NIM_CACHE_PATH"
sudo chmod -R g+rwX "$NIM_CACHE_PATH"
Run the container with the cache mounted:
docker run --gpus all \
-v "$NIM_CACHE_PATH:/opt/nim/.cache" \
...
To run as a custom user (e.g., your host user), pass -u <uid>:0:
docker run --gpus all -u $(id -u):0 \
-v "$NIM_CACHE_PATH:/opt/nim/.cache" \
...
Important
When using -u <uid>, you must include :0 to set GID 0 (e.g., -u $(id -u):0). The container’s writable directories are group-owned by GID 0. Without it, the container will fail with PermissionError when writing to cache, config, or log paths.
Tip
To make this setting permanent across terminal sessions, you can add export LOCAL_NIM_CACHE=~/.cache/nim to your ~/.bashrc or ~/.zshrc profile.
Installation#
Before running a NIM LLM container, you must authenticate with your deployment source, accept the governing terms, and pull the container image.
Docker Login#
To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry with the following command:
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
Use $oauthtoken as the username and NGC_API_KEY as the password. The $oauthtoken username is a special name that indicates that you will authenticate with an API key and not a user name and password.
Accept the Governing Terms#
Before you download a given NIM for the first time, you must accept the governing terms in the browser. Navigate to the NGC Catalog page for the NIM and click the Accept Terms button: Nemotron 3 Ultra 550B-A55B on NGC
Pull the Container Image#
After you have generated your API key and authenticated with your deployment source, you can download the NIM container image to your host machine.
Use the docker pull command to fetch the NIM container image.
docker pull nvcr.io/nim/nvidia/nemotron-3-ultra-550b-a55b:2.0.5-variant
Storage and Startup Notes#
Nemotron 3 Ultra 550B-A55B is a large model, so make sure the host has enough free disk space for the container image and the model cache. As a rough guide, the container image is approximately 38 GB, and the model cache ranges from approximately 330 GB for an NVFP4 profile to approximately 1.1–1.7 TB for a BF16 profile, depending on precision and GPU configuration. Reserve additional space if you download multiple profiles or keep older container images on the same host.
The first launch downloads the model artifacts into the mounted cache directory, which can take a significant amount of time depending on your hardware and network. Subsequent launches reuse the mounted cache and start faster.
Tip
By default, the model download produces little log output and can appear idle.
To follow the download progress, add -e NIM_LOG_LEVEL=INFO to the docker run
command.
To pre-populate the cache before serving traffic, run download-to-cache with
the same image, API key, and cache mount:
docker run --gpus=all \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
nvcr.io/nim/nvidia/nemotron-3-ultra-550b-a55b:2.0.5-variant \
download-to-cache
Run NIM#
Run the container using your NGC API Key to authenticate and download the model.
docker run --gpus=all \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8000:8000 \
nvcr.io/nim/nvidia/nemotron-3-ultra-550b-a55b:2.0.5-variant
Recommended Runtime Settings#
For optimal performance, set NIM_PASSTHROUGH_ARGS to the recommended values for your GPU and precision combination (refer to the next section). NVFP4 checkpoints require additional environment variables and passthrough arguments. For more information about how NIM_PASSTHROUGH_ARGS is processed, refer to Advanced Configuration.
Per-Profile Passthrough Arguments#
The following table lists additional recommended NIM_PASSTHROUGH_ARGS values for profiles that require profile-specific overrides:
GPU |
TP |
PP |
Precision |
Recommended |
|---|---|---|---|---|
B200/GB200 |
2 |
1 |
NVFP4 |
|
H100 |
8 |
2 |
BF16 |
|
H200 |
8 |
1 |
BF16 |
|
B300/GB300 |
4 |
1 |
BF16 |
|
Profiles not listed in this table start successfully with default settings and
do not require additional NIM_PASSTHROUGH_ARGS.
Context Length#
Nemotron 3 Ultra 550B-A55B natively supports a context window of 262,144 tokens (256K).
This is the default --max-model-len and the value reported by the /v1/models
endpoint, so no additional configuration is required for context lengths up to
262,144 tokens.
To serve a longer context window (up to 1M tokens), set the
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 environment variable and pass the desired
--max-model-len value in NIM_PASSTHROUGH_ARGS. For example, to serve a
1,048,576-token (1M) context window:
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export NIM_PASSTHROUGH_ARGS="--max-model-len 1048576"
Note
A larger context window requires significantly more KV cache memory, which reduces the number of requests that can be served concurrently. Validate GPU memory headroom for your target hardware before serving production traffic. Context lengths beyond the native 262,144-token window extend the model past its configured positional encoding range, so validate output quality for your workload before relying on context lengths greater than 262,144 tokens.
Additional Settings for NVFP4 Checkpoints#
When running an NVFP4 checkpoint, set the following environment variables before starting the container, in addition to the per-profile NIM_PASSTHROUGH_ARGS listed in the previous table:
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export NIM_PASSTHROUGH_ARGS="--kv-cache-dtype nvfp4 --mamba-ssm-cache-dtype float16 --enable-mamba-cache-stochastic-rounding --mamba-cache-philox-rounds 5"
Append the per-profile values from the previous table to the NIM_PASSTHROUGH_ARGS value above. For example, on a B200 GPU with TP=2 and precision NVFP4:
export NIM_PASSTHROUGH_ARGS="--kv-cache-dtype nvfp4 --mamba-ssm-cache-dtype float16 --enable-mamba-cache-stochastic-rounding --mamba-cache-philox-rounds 5 --max-num-seqs 64"
Speculative Decoding with Multi-Token Prediction (MTP)#
Nemotron 3 Ultra 550B-A55B has a built-in Multi-Token Prediction (MTP)
module that vLLM uses as a draft model for speculative decoding, reducing
per-token latency at low concurrency. Set num_speculative_tokens to the
number of tokens to draft per step:
export NIM_PASSTHROUGH_ARGS="--speculative-config '{\"method\":\"mtp\",\"num_speculative_tokens\":2}'"
Higher values draft more tokens per step but accept fewer at deeper positions
(diminishing returns); 1–3 is a good range.
Important
Wrap the JSON in single quotes exactly as shown. NIM_PASSTHROUGH_ARGS is
tokenized before it reaches vLLM, so unquoted JSON loses its double quotes and
vLLM rejects it (--speculative-config ... cannot be converted).
MTP helps most at low concurrency. For speculative-decoding behavior, metrics, and limitations, refer to the vLLM documentation.
The throughput improvements in the release notes were measured with the
following configuration. Set VLLM_SSM_CONV_STATE_LAYOUT=DS and the align
Mamba cache mode, and tune --max-num-seqs for your workload (72 for chat,
40 for software-engineering workloads):
export VLLM_SSM_CONV_STATE_LAYOUT=DS
export NIM_PASSTHROUGH_ARGS="--max-model-len 262144 --enable-prefix-caching --mamba-cache-mode align --max-num-batched-tokens 32768 --block-size 64 --max-num-seqs 72 --speculative-config '{\"method\":\"mtp\",\"num_speculative_tokens\":1}'"
Example: Pass Settings to docker run#
The following example combines the recommended environment variables and passthrough arguments with the base docker run command shown in the previous section:
docker run --gpus=all \
-e NGC_API_KEY=$NGC_API_KEY \
-e VLLM_USE_FLASHINFER_MOE_FP8=1 \
-e VLLM_USE_FLASHINFER_MOE_FP4=1 \
-e NIM_PASSTHROUGH_ARGS="--kv-cache-dtype nvfp4 --mamba-ssm-cache-dtype float16 --enable-mamba-cache-stochastic-rounding --mamba-cache-philox-rounds 5 --max-num-seqs 64" \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8000:8000 \
nvcr.io/nim/nvidia/nemotron-3-ultra-550b-a55b:2.0.5-variant
Interact with the API#
The NIM exposes OpenAI-compatible and Anthropic-compatible inference endpoints. The OpenAI-compatible endpoints are the following:
Chat Completions:
/v1/chat/completionsText Completions:
/v1/completionsResponses:
/v1/responses
The Anthropic-compatible endpoints are the following:
Messages:
/v1/messagesCount Tokens:
/v1/messages/count_tokens
Tip
Chat Completions, Text Completions, Responses, and Anthropic Messages support streaming.
Note
To carry context across turns, send the prior turns inline in the input array.
To enable server-side storage so that previous_response_id and response
retrieval work, start the container with VLLM_ENABLE_RESPONSES_API_STORE=1.
Memory usage will be impacted.
Send a Chat Completion Request#
After the server is running, you can send a request to the chat completion endpoint:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/nemotron-3-ultra-550b-a55b",
"messages": [
{
"role": "user",
"content": "Hello! How are you?"
}
],
"max_tokens": 128,
"temperature": 0.0
}'
The response has the following general format. The exact generated text can vary by
sampling settings and runtime configuration. If the container is started with
--reasoning-parser nemotron_v3, the reasoning field can contain parsed
reasoning text; otherwise it can be null.
{
"id": "chatcmpl-87d0c4524fb6f1a4",
"object": "chat.completion",
"created": 1769635152,
"model": "nvidia/nemotron-3-ultra-550b-a55b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! I'm ready to help.",
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": [],
"reasoning": "..."
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null,
"token_ids": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 41,
"total_tokens": 46,
"completion_tokens": 5,
"prompt_tokens_details": null,
"completion_tokens_details": {
"reasoning_tokens": 2
}
},
"prompt_logprobs": null,
"prompt_token_ids": null,
"kv_transfer_params": null
}
Hold a Multi-Turn Conversation#
The Chat Completions endpoint is stateless. To carry context across turns,
resend the full conversation in the messages array, alternating user and
assistant turns after an optional system message:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/nemotron-3-ultra-550b-a55b",
"messages": [
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": "My favorite color is blue."},
{"role": "assistant", "content": "Noted — your favorite color is blue."},
{"role": "user", "content": "What did I just tell you?"}
],
"max_tokens": 128,
"temperature": 0.0
}'
Note
Append only the assistant content from prior turns to the history. When the
reasoning parser (--reasoning-parser nemotron_v3) is enabled, do not feed the
reasoning field back into messages.
Use the OpenAI Python SDK#
You can direct the OpenAI Python SDK at the NIM endpoint by setting
base_url to the local /v1 API path and providing any non-empty API key:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-used",
)
response = client.chat.completions.create(
model="nvidia/nemotron-3-ultra-550b-a55b",
messages=[{"role": "user", "content": "Summarize GPU computing in one sentence."}],
max_tokens=128,
temperature=0.0,
)
print(response.choices[0].message.content)
For Anthropic Python SDK examples, refer to Messages (Anthropic-compatible).
Control Thinking Budget#
To expose parsed reasoning output and use thinking controls, start the container with the Nemotron 3 reasoning parser:
docker run --gpus=all \
-e NGC_API_KEY=$NGC_API_KEY \
-e NIM_PASSTHROUGH_ARGS="--reasoning-parser nemotron_v3" \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8000:8000 \
nvcr.io/nim/nvidia/nemotron-3-ultra-550b-a55b:2.0.5-variant
Then enable thinking and set a thinking-token budget in the request:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/nemotron-3-ultra-550b-a55b",
"messages": [
{
"role": "user",
"content": "I have a 3x3 grid of integers. Rows sum to 15, 18, 21. Columns sum to 12, 20, 22. Center is 7, top-left is 2. Find one valid grid."
}
],
"max_tokens": 2048,
"temperature": 0,
"seed": 42,
"chat_template_kwargs": {"enable_thinking": true},
"thinking_token_budget": 10
}'
The thinking_token_budget value limits the thinking portion of generation.
The max_tokens value still caps total generated tokens for the request. When
the reasoning parser is enabled, the response reports how many of the generated
tokens were reasoning tokens in usage.completion_tokens_details.reasoning_tokens
(completion_tokens remains the combined total of reasoning tokens and content tokens).
Note
thinking_token_budget is supported on the Chat Completions endpoint
(/v1/chat/completions). The Responses endpoint (/v1/responses) does not
enforce a thinking-token budget.
Enable Tool Calling and MCP Workflows#
For Nemotron 3 Ultra 550B-A55B, enable OpenAI-compatible tool calling by
adding the following arguments to NIM_PASSTHROUGH_ARGS before starting the
container:
export NIM_PASSTHROUGH_ARGS="--enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser nemotron_v3"
If you already use NIM_PASSTHROUGH_ARGS for profile-specific settings, append
these arguments to the same string. Both --enable-auto-tool-choice and
--tool-call-parser qwen3_coder are required when using "tool_choice": "auto".
The --reasoning-parser nemotron_v3 setting enables the built-in reasoning
parser for Nemotron 3 output.
The following request provides multiple tool choices and lets the model choose which one to call:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/nemotron-3-ultra-550b-a55b",
"messages": [
{
"role": "user",
"content": "What is the weather in Santa Clara, CA?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city.",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
},
{
"type": "function",
"function": {
"name": "search_docs",
"description": "Search internal documentation.",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"}
},
"required": ["query"]
}
}
}
],
"tool_choice": "auto",
"max_tokens": 256
}'
A successful tool-calling response includes a tool_calls array under
choices[0].message. Your application executes the selected tool and sends the
tool result back to the model in a follow-up Chat Completions request.
For MCP, connect to MCP servers in your client application, convert the MCP
tool schemas to the OpenAI tools format, and pass them to
/v1/chat/completions. The NIM container does not connect to MCP servers
directly. For details and LangChain/LangGraph examples, refer to
Tool Calling and MCP Integration.
Structured JSON Output#
For structured-output use cases, request JSON mode through the
OpenAI-compatible response_format parameter and validate the response with
your preferred schema library, such as Pydantic:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/nemotron-3-ultra-550b-a55b",
"messages": [
{
"role": "user",
"content": "Return JSON with keys name and purpose for NVIDIA NIM."
}
],
"response_format": {"type": "json_object"},
"max_tokens": 128,
"temperature": 0.0
}'
Frameworks that support OpenAI-compatible chat completions, such as
LangChain, LangGraph, LlamaIndex, Pipecat, OpenCode, and similar agent
frameworks, can use the local NIM endpoint by setting their base URL to
http://localhost:8000/v1 and using the served model name.
Verify Health Endpoints#
You can verify that the NIM container is running and ready to accept requests by checking its health endpoints. By default, these endpoints are served on port 8000. If you set NIM_HEALTH_PORT, use that port instead.
Live Endpoint#
Perform a liveness check to see if the server is running:
curl -v http://localhost:8000/v1/health/live
Example response:
GET /v1/health/live HTTP/1.1
Host: localhost:8000
User-Agent: curl/7.81.0
Accept: */*
HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Content-Type: application/json
Content-Length: 61
Connection: keep-alive
Cache-Control: no-store, no-cache, must-revalidate
{
"object": "health.response",
"message": "live",
"status": "live"
}
Ready Endpoint#
Perform a readiness check to see if the model is fully loaded and ready for inference:
curl -v http://localhost:8000/v1/health/ready
Example response:
GET /v1/health/ready HTTP/1.1
Host: localhost:8000
User-Agent: curl/7.81.0
Accept: */*
HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Content-Type: application/json
Content-Length: 63
Connection: keep-alive
Cache-Control: no-store, no-cache, must-revalidate
{
"object": "health.response",
"message": "ready",
"status": "ready"
}
Streaming#
To receive responses incrementally as they are generated, you can enable streaming by adding "stream": true to your request payload. This is supported across the /v1/chat/completions, /v1/completions, /v1/responses, and /v1/messages endpoints.
When streaming is enabled, the API returns a sequence of Server-Sent Events (SSE).
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/nemotron-3-ultra-550b-a55b",
"messages": [
{
"role": "user",
"content": "Write a short poem about a robot."
}
],
"max_tokens": 100,
"stream": true
}'
For Chat Completions and Text Completions, the response is streamed back in chunks, with each chunk containing a data JSON object. These streams terminate with a data: [DONE] message:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-ultra-550b-a55b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-ultra-550b-a55b","choices":[{"index":0,"delta":{"content":"In"},"logprobs":null,"finish_reason":null,"token_ids":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-ultra-550b-a55b","choices":[{"index":0,"delta":{"content":" cir"},"logprobs":null,"finish_reason":null,"token_ids":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-ultra-550b-a55b","choices":[{"index":0,"delta":{"content":"cuits"},"logprobs":null,"finish_reason":null,"token_ids":null}]}
...
data: [DONE]
For the Responses API, stream events use typed SSE events such as
response.output_text.delta and terminate with response.completed. For
Anthropic-compatible Messages, stream events use Anthropic event names such as
content_block_delta and terminate with message_stop.