Get Started with Nemotron 3 Ultra 550B-A55B#

Nemotron 3 Ultra 550B-A55B, the largest of the Nemotron 3 models, provides state-of-the-art accuracy and reasoning performance. This text-only, reasoning-capable model has 550B total parameters and up to 55B active parameters per token. It uses a hybrid Mamba-Transformer (Nemotron-H) mixture-of-experts architecture. The NIM exposes OpenAI-compatible and Anthropic-compatible APIs, so existing clients can use the same container for Chat Completions, Responses, Anthropic Messages, tool calling, and agentic workflows.

This page is the model-specific Day-0 guide for Nemotron 3 Ultra 550B-A55B. For the generic NIM LLM onboarding path, refer to Quickstart. For full endpoint examples, refer to API Reference. For tool calling and MCP integration, refer to Tool Calling and MCP Integration.

Prerequisites#

Before deploying a NIM LLM container, ensure your environment meets the following requirements:

Hardware Requirements#

The following are the minimum required specifications for supported hardware components:

Requirement

Specification

CPU

AMD64, ARM64

GPU

Refer to Support Matrix for NIM Day 0

Software Requirements#

Minimum required versions for supported software components.

Requirement

Specification

Operating System

Ubuntu 22.04 LTS or later recommended

Container Toolkit

1.14.0 or later

CUDA SDK

12.9 or later

GPU Driver

580 or later

Docker

24.0 or later

Operating System#

While other Linux distributions can be compatible with NIM, they have not been officially validated.

We recommend using Ubuntu 22.04 LTS or later for the best experience.

CUDA SDK#

Install CUDA SDK by following the CUDA installation guide for Linux.

GPU Drivers#

Install the NVIDIA GPU drivers by following the NVIDIA Driver Installation Guide.

Docker#

Docker is required to run the containerized NIM services.

  1. Install Docker Engine for your Linux distribution by following the Docker Engine installation guide.

  2. Verify that the Docker daemon is running and that your user can execute docker commands without sudo. Add your user to the docker group if needed:

    sudo groupadd docker
    sudo usermod -aG docker $USER
    
  3. Log out and back in for the group change to take effect.

Container Toolkit#

The NVIDIA Container Toolkit enables Docker containers to access the host GPU.

  1. Install the toolkit by following the NVIDIA Container Toolkit installation guide.

  2. Configure Docker to use the NVIDIA runtime by following the Docker configuration steps.

  3. Restart the Docker daemon after configuration:

    sudo systemctl restart docker
    

NIM Container Access#

To download and deploy NIM containers, you need one of the following:

Generate Access Credentials#

An NGC Personal API Key is required to access NVIDIA NIM containers and models hosted on NGC.

  1. Generate the Personal API Key on the Setup API Keys page.

  2. When creating the Personal API key, select at least NGC Catalog from the Services Included list. You can also include additional services if you want to use the same key for other purposes.

Warning

Legacy API keys are not supported by NIM. Always use a Personal API Key.

Verify NVIDIA Runtime Access#

To ensure that your setup is correct, run the following command:

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

This command should produce output similar to one of the following, where you can confirm CUDA driver version, and available GPUs.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:1B:00.0 Off |                    0 |
| N/A   36C    P0            112W /  700W |   78489MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Configuration#

Use environment variables to control authentication and model caching.

Export NGC API Key#

  1. Export the variable in your shell (temporary), replacing <VALUE> with your actual API key:

    export NGC_API_KEY=<VALUE>
    
  2. Persist the variable (optional):

    If using bash:

    echo "export NGC_API_KEY=$NGC_API_KEY" >> ~/.bashrc
    

    If using zsh:

    echo "export NGC_API_KEY=$NGC_API_KEY" >> ~/.zshrc
    
  3. Verify the variable is set:

    echo "$NGC_API_KEY"
    

Model Cache#

NIM downloads model weights to a cache on the host that you mount into the container. Artifacts persist across restarts, so you do not pull the full model on every run.

Local Cache#

An essential variable to configure on your host system is the cache path directory. This directory is mapped from the host machine to container; assets (for example, model weights) are downloaded to this host directory and persist across container restarts. Configuring a local cache is highly recommended, as it avoids re-downloading large model files upon subsequent container restarts. You can name the environment variable containing the path to the local cache whatever you want.

Create the cache directory and export an environment variable:

export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p $LOCAL_NIM_CACHE
# Optionally add sticky bit to avoid issues writing to the cache if the container is running as a different user
chmod -R a+rwxt $LOCAL_NIM_CACHE

When you start the NIM container, you must map your host machine’s local cache directory ($LOCAL_NIM_CACHE) to the container’s internal cache path (/opt/nim/.cache) using a Docker volume mount, such as -v "$LOCAL_NIM_CACHE:/opt/nim/.cache". This mapping ensures that the large model weights downloaded by the container are saved to your host machine. Because containers are ephemeral, any data stored only inside the container is lost when it stops. By using a volume mount, subsequent container runs detect the existing model files in your local cache and skip the lengthy download process, allowing the NIM to start up faster.

Cache Directory Permissions#

The NIM container runs as a non-root user with GID 0 (root group). The cache directory on your host must be writable by GID 0:

export NIM_CACHE_PATH=/tmp/nim-cache
mkdir -p "$NIM_CACHE_PATH"
sudo chgrp -R 0 "$NIM_CACHE_PATH"
sudo chmod -R g+rwX "$NIM_CACHE_PATH"

Run the container with the cache mounted:

docker run --gpus all \
  -v "$NIM_CACHE_PATH:/opt/nim/.cache" \
  ...

To run as a custom user (e.g., your host user), pass -u <uid>:0:

docker run --gpus all -u $(id -u):0 \
  -v "$NIM_CACHE_PATH:/opt/nim/.cache" \
  ...

Important

When using -u <uid>, you must include :0 to set GID 0 (e.g., -u $(id -u):0). The container’s writable directories are group-owned by GID 0. Without it, the container will fail with PermissionError when writing to cache, config, or log paths.

Tip

To make this setting permanent across terminal sessions, you can add export LOCAL_NIM_CACHE=~/.cache/nim to your ~/.bashrc or ~/.zshrc profile.

Installation#

Before running a NIM LLM container, you must authenticate with your deployment source, accept the governing terms, and pull the container image.

Docker Login#

To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry with the following command:

echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Use $oauthtoken as the username and NGC_API_KEY as the password. The $oauthtoken username is a special name that indicates that you will authenticate with an API key and not a user name and password.

Accept the Governing Terms#

Before you download a given NIM for the first time, you must accept the governing terms in the browser. Navigate to the NGC Catalog page for the NIM and click the Accept Terms button: Nemotron 3 Ultra 550B-A55B on NGC

Pull the Container Image#

After you have generated your API key and authenticated with your deployment source, you can download the NIM container image to your host machine.

Use the docker pull command to fetch the NIM container image.

docker pull nvcr.io/nim/nvidia/nemotron-3-ultra-550b-a55b:2.0.5-variant

Storage and Startup Notes#

Nemotron 3 Ultra 550B-A55B is a large model, so make sure the host has enough free disk space for the container image and the model cache. As a rough guide, the container image is approximately 38 GB, and the model cache ranges from approximately 330 GB for an NVFP4 profile to approximately 1.1–1.7 TB for a BF16 profile, depending on precision and GPU configuration. Reserve additional space if you download multiple profiles or keep older container images on the same host.

The first launch downloads the model artifacts into the mounted cache directory, which can take a significant amount of time depending on your hardware and network. Subsequent launches reuse the mounted cache and start faster.

Tip

By default, the model download produces little log output and can appear idle. To follow the download progress, add -e NIM_LOG_LEVEL=INFO to the docker run command.

To pre-populate the cache before serving traffic, run download-to-cache with the same image, API key, and cache mount:

docker run --gpus=all \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  nvcr.io/nim/nvidia/nemotron-3-ultra-550b-a55b:2.0.5-variant \
  download-to-cache

Run NIM#

Run the container using your NGC API Key to authenticate and download the model.

docker run --gpus=all \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -p 8000:8000 \
  nvcr.io/nim/nvidia/nemotron-3-ultra-550b-a55b:2.0.5-variant

Interact with the API#

The NIM exposes OpenAI-compatible and Anthropic-compatible inference endpoints. The OpenAI-compatible endpoints are the following:

  • Chat Completions: /v1/chat/completions

  • Text Completions: /v1/completions

  • Responses: /v1/responses

The Anthropic-compatible endpoints are the following:

  • Messages: /v1/messages

  • Count Tokens: /v1/messages/count_tokens

Tip

Chat Completions, Text Completions, Responses, and Anthropic Messages support streaming.

Note

To carry context across turns, send the prior turns inline in the input array. To enable server-side storage so that previous_response_id and response retrieval work, start the container with VLLM_ENABLE_RESPONSES_API_STORE=1. Memory usage will be impacted.

Send a Chat Completion Request#

After the server is running, you can send a request to the chat completion endpoint:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-ultra-550b-a55b",
    "messages": [
      {
        "role": "user",
        "content": "Hello! How are you?"
      }
    ],
    "max_tokens": 128,
    "temperature": 0.0
  }'

The response has the following general format. The exact generated text can vary by sampling settings and runtime configuration. If the container is started with --reasoning-parser nemotron_v3, the reasoning field can contain parsed reasoning text; otherwise it can be null.

{
  "id": "chatcmpl-87d0c4524fb6f1a4",
  "object": "chat.completion",
  "created": 1769635152,
  "model": "nvidia/nemotron-3-ultra-550b-a55b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I'm ready to help.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": "..."
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 41,
    "total_tokens": 46,
    "completion_tokens": 5,
    "prompt_tokens_details": null,
    "completion_tokens_details": {
      "reasoning_tokens": 2
    }
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

Hold a Multi-Turn Conversation#

The Chat Completions endpoint is stateless. To carry context across turns, resend the full conversation in the messages array, alternating user and assistant turns after an optional system message:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-ultra-550b-a55b",
    "messages": [
      {"role": "system", "content": "You are a concise assistant."},
      {"role": "user", "content": "My favorite color is blue."},
      {"role": "assistant", "content": "Noted — your favorite color is blue."},
      {"role": "user", "content": "What did I just tell you?"}
    ],
    "max_tokens": 128,
    "temperature": 0.0
  }'

Note

Append only the assistant content from prior turns to the history. When the reasoning parser (--reasoning-parser nemotron_v3) is enabled, do not feed the reasoning field back into messages.

Use the OpenAI Python SDK#

You can direct the OpenAI Python SDK at the NIM endpoint by setting base_url to the local /v1 API path and providing any non-empty API key:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-used",
)

response = client.chat.completions.create(
    model="nvidia/nemotron-3-ultra-550b-a55b",
    messages=[{"role": "user", "content": "Summarize GPU computing in one sentence."}],
    max_tokens=128,
    temperature=0.0,
)

print(response.choices[0].message.content)

For Anthropic Python SDK examples, refer to Messages (Anthropic-compatible).

Control Thinking Budget#

To expose parsed reasoning output and use thinking controls, start the container with the Nemotron 3 reasoning parser:

docker run --gpus=all \
  -e NGC_API_KEY=$NGC_API_KEY \
  -e NIM_PASSTHROUGH_ARGS="--reasoning-parser nemotron_v3" \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -p 8000:8000 \
  nvcr.io/nim/nvidia/nemotron-3-ultra-550b-a55b:2.0.5-variant

Then enable thinking and set a thinking-token budget in the request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-ultra-550b-a55b",
    "messages": [
      {
        "role": "user",
        "content": "I have a 3x3 grid of integers. Rows sum to 15, 18, 21. Columns sum to 12, 20, 22. Center is 7, top-left is 2. Find one valid grid."
      }
    ],
    "max_tokens": 2048,
    "temperature": 0,
    "seed": 42,
    "chat_template_kwargs": {"enable_thinking": true},
    "thinking_token_budget": 10
  }'

The thinking_token_budget value limits the thinking portion of generation. The max_tokens value still caps total generated tokens for the request. When the reasoning parser is enabled, the response reports how many of the generated tokens were reasoning tokens in usage.completion_tokens_details.reasoning_tokens (completion_tokens remains the combined total of reasoning tokens and content tokens).

Note

thinking_token_budget is supported on the Chat Completions endpoint (/v1/chat/completions). The Responses endpoint (/v1/responses) does not enforce a thinking-token budget.

Enable Tool Calling and MCP Workflows#

For Nemotron 3 Ultra 550B-A55B, enable OpenAI-compatible tool calling by adding the following arguments to NIM_PASSTHROUGH_ARGS before starting the container:

export NIM_PASSTHROUGH_ARGS="--enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser nemotron_v3"

If you already use NIM_PASSTHROUGH_ARGS for profile-specific settings, append these arguments to the same string. Both --enable-auto-tool-choice and --tool-call-parser qwen3_coder are required when using "tool_choice": "auto". The --reasoning-parser nemotron_v3 setting enables the built-in reasoning parser for Nemotron 3 output.

The following request provides multiple tool choices and lets the model choose which one to call:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-ultra-550b-a55b",
    "messages": [
      {
        "role": "user",
        "content": "What is the weather in Santa Clara, CA?"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get the current weather for a city.",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {"type": "string"}
            },
            "required": ["location"]
          }
        }
      },
      {
        "type": "function",
        "function": {
          "name": "search_docs",
          "description": "Search internal documentation.",
          "parameters": {
            "type": "object",
            "properties": {
              "query": {"type": "string"}
            },
            "required": ["query"]
          }
        }
      }
    ],
    "tool_choice": "auto",
    "max_tokens": 256
  }'

A successful tool-calling response includes a tool_calls array under choices[0].message. Your application executes the selected tool and sends the tool result back to the model in a follow-up Chat Completions request.

For MCP, connect to MCP servers in your client application, convert the MCP tool schemas to the OpenAI tools format, and pass them to /v1/chat/completions. The NIM container does not connect to MCP servers directly. For details and LangChain/LangGraph examples, refer to Tool Calling and MCP Integration.

Structured JSON Output#

For structured-output use cases, request JSON mode through the OpenAI-compatible response_format parameter and validate the response with your preferred schema library, such as Pydantic:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-ultra-550b-a55b",
    "messages": [
      {
        "role": "user",
        "content": "Return JSON with keys name and purpose for NVIDIA NIM."
      }
    ],
    "response_format": {"type": "json_object"},
    "max_tokens": 128,
    "temperature": 0.0
  }'

Frameworks that support OpenAI-compatible chat completions, such as LangChain, LangGraph, LlamaIndex, Pipecat, OpenCode, and similar agent frameworks, can use the local NIM endpoint by setting their base URL to http://localhost:8000/v1 and using the served model name.

Verify Health Endpoints#

You can verify that the NIM container is running and ready to accept requests by checking its health endpoints. By default, these endpoints are served on port 8000. If you set NIM_HEALTH_PORT, use that port instead.

Live Endpoint#

Perform a liveness check to see if the server is running:

curl -v http://localhost:8000/v1/health/live

Example response:

GET /v1/health/live HTTP/1.1
Host: localhost:8000
User-Agent: curl/7.81.0
Accept: */*

HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Content-Type: application/json
Content-Length: 61
Connection: keep-alive
Cache-Control: no-store, no-cache, must-revalidate

{
  "object": "health.response",
  "message": "live",
  "status": "live"
}

Ready Endpoint#

Perform a readiness check to see if the model is fully loaded and ready for inference:

curl -v http://localhost:8000/v1/health/ready

Example response:

GET /v1/health/ready HTTP/1.1
Host: localhost:8000
User-Agent: curl/7.81.0
Accept: */*

HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Content-Type: application/json
Content-Length: 63
Connection: keep-alive
Cache-Control: no-store, no-cache, must-revalidate

{
  "object": "health.response",
  "message": "ready",
  "status": "ready"
}

Streaming#

To receive responses incrementally as they are generated, you can enable streaming by adding "stream": true to your request payload. This is supported across the /v1/chat/completions, /v1/completions, /v1/responses, and /v1/messages endpoints.

When streaming is enabled, the API returns a sequence of Server-Sent Events (SSE).

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-ultra-550b-a55b",
    "messages": [
      {
        "role": "user",
        "content": "Write a short poem about a robot."
      }
    ],
    "max_tokens": 100,
    "stream": true
  }'

For Chat Completions and Text Completions, the response is streamed back in chunks, with each chunk containing a data JSON object. These streams terminate with a data: [DONE] message:

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-ultra-550b-a55b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-ultra-550b-a55b","choices":[{"index":0,"delta":{"content":"In"},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-ultra-550b-a55b","choices":[{"index":0,"delta":{"content":" cir"},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-ultra-550b-a55b","choices":[{"index":0,"delta":{"content":"cuits"},"logprobs":null,"finish_reason":null,"token_ids":null}]}

...

data: [DONE]

For the Responses API, stream events use typed SSE events such as response.output_text.delta and terminate with response.completed. For Anthropic-compatible Messages, stream events use Anthropic event names such as content_block_delta and terminate with message_stop.