Quickstart#

How to Choose the Right Model Profile#

NIM VLM automatically selects the most optimal model profile based on the detected hardware (for example, number of GPUs or GPU architecture). If you need to manually override this selection, you can set the NIM_MODEL_PROFILE environment variable. For more information, see Profile Selection.

Run NIM#

Before running a NIM VLM container, make sure you have met all prerequisites and completed installation and configuration, including setting your API keys, logging in to Docker, pulling the container image, and configuring LOCAL_NIM_CACHE.

Tip

Mounting a local cache directory lets you avoid re-downloading the model on subsequent restarts. See Local cache for details.

Run the container using your NGC API key to authenticate and download the model.

docker run --gpus=all \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -p 8000:8000 \
  nvcr.io/nim/nvidia/nemotron-3-content-safety:2.0.0

Interact With the API#

There are two main inference endpoints:

  • Chat Completions: /v1/chat/completions

  • Text Completions: /v1/completions

Tip

Both endpoints support streaming.

Find the Model Name#

Replace <model-name> in the examples below with the model name served by your NIM container. To find it, query the models endpoint:

curl -s http://localhost:8000/v1/models

The id field in the response is the model name to use in your requests. For model-specific NIMs, this matches the model identifier (for example, nvidia/nemotron-3-content-safety). For model-free NIMs, the name is derived from the container image.

Note

The model name is user-configurable by setting the NIM_SERVED_MODEL_NAME environment variable. For more information, see Environment Variables.

Send a Chat Completion Request#

Once the server is running, you can send a request to the chat completion endpoint:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<model-name>",
    "messages": [
      {
        "role": "user",
        "content": "Hello! How are you?"
      }
    ],
    "max_tokens": 100
  }'

The expected response:

{
   "id":"chatcmpl-8d90b13688d853b4",
   "object":"chat.completion",
   "created":1777488981,
   "model":"nvidia/nemotron-3-content-safety",
   "choices":[
      {
         "index":0,
         "message":{
            "role":"assistant",
            "content":"User Safety: safe",
            "refusal":null,
            "annotations":null,
            "audio":null,
            "function_call":null,
            "tool_calls":[

            ],
            "reasoning":null
         },
         "logprobs":null,
         "finish_reason":"stop",
         "stop_reason":106,
         "token_ids":null
      }
   ],
   "service_tier":null,
   "system_fingerprint":null,
   "usage":{
      "prompt_tokens":431,
      "total_tokens":436,
      "completion_tokens":5,
      "prompt_tokens_details":null
   },
   "prompt_logprobs":null,
   "prompt_token_ids":null,
   "kv_transfer_params":null
}

Verify Health Endpoints#

You can verify that the NIM container is running and ready to accept requests by checking its health endpoints. By default, these endpoints are served on port 8000. If you set NIM_HEALTH_PORT, use that port instead.

Live Endpoint#

Perform a liveness check to see if the server is running:

curl http://localhost:8000/v1/health/live

Example response:

GET /v1/health/live HTTP/1.1
Host: localhost:8000
User-Agent: curl/7.81.0
Accept: */*

HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Content-Type: application/json
Content-Length: 61
Connection: keep-alive
Cache-Control: no-store, no-cache, must-revalidate

{
  "object": "health.response",
  "message": "live",
  "status": "live"
}

Ready Endpoint#

Perform a readiness check to see if the model is fully loaded and ready for inference:

curl http://localhost:8000/v1/health/ready

Example response:

GET /v1/health/ready HTTP/1.1
Host: localhost:8000
User-Agent: curl/7.81.0
Accept: */*

HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Content-Type: application/json
Content-Length: 63
Connection: keep-alive
Cache-Control: no-store, no-cache, must-revalidate

{
  "object": "health.response",
  "message": "ready",
  "status": "ready"
}

Streaming#

To receive responses incrementally as they are generated, enable streaming by adding "stream": true to your request payload. This is supported for /v1/chat/completions and /v1/completions.

When streaming is enabled, the API returns a sequence of Server-Sent Events (SSE).

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<model-name>",
    "messages": [
      {
        "role": "user",
        "content": "Write a short poem about a robot."
      }
    ],
    "max_tokens": 100,
    "stream": true
  }'

The response is streamed in chunks, each with a data JSON object. The stream terminates with a data: [DONE] message:

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-content-safety","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-content-safety","choices":[{"index":0,"delta":{"content":"In"},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-content-safety","choices":[{"index":0,"delta":{"content":" cir"},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-content-safety","choices":[{"index":0,"delta":{"content":"cuits"},"logprobs":null,"finish_reason":null,"token_ids":null}]}

...

data: [DONE]