Quickstart#
How to Choose the Right Model Profile#
NIM VLM automatically selects the most optimal model profile based
on the detected hardware (for example, number of GPUs or GPU architecture). If you
need to manually override this selection, you can set the NIM_MODEL_PROFILE
environment variable. For more information, see Profile Selection.
Run NIM#
Before running a NIM VLM container, make sure you have met all
prerequisites and completed
installation and
configuration, including setting your API keys,
logging in to Docker, pulling the container image, and configuring
LOCAL_NIM_CACHE.
Tip
Mounting a local cache directory lets you avoid re-downloading the model on subsequent restarts. See Local cache for details.
Run the container using your NGC API key to authenticate and download the model.
docker run --gpus=all \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8000:8000 \
nvcr.io/nim/nvidia/nemotron-3-content-safety:2.0.0
Interact With the API#
There are two main inference endpoints:
Chat Completions:
/v1/chat/completionsText Completions:
/v1/completions
Tip
Both endpoints support streaming.
Find the Model Name#
Replace <model-name> in the examples below with the model name served by your
NIM container. To find it, query the models endpoint:
curl -s http://localhost:8000/v1/models
The id field in the response is the model name to use in your requests. For
model-specific NIMs, this matches the model identifier (for example,
nvidia/nemotron-3-content-safety). For model-free NIMs, the name is derived from the
container image.
Note
The model name is user-configurable by setting the
NIM_SERVED_MODEL_NAME environment variable. For more information, see
Environment Variables.
Send a Chat Completion Request#
Once the server is running, you can send a request to the chat completion endpoint:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<model-name>",
"messages": [
{
"role": "user",
"content": "Hello! How are you?"
}
],
"max_tokens": 100
}'
The expected response:
{
"id":"chatcmpl-8d90b13688d853b4",
"object":"chat.completion",
"created":1777488981,
"model":"nvidia/nemotron-3-content-safety",
"choices":[
{
"index":0,
"message":{
"role":"assistant",
"content":"User Safety: safe",
"refusal":null,
"annotations":null,
"audio":null,
"function_call":null,
"tool_calls":[
],
"reasoning":null
},
"logprobs":null,
"finish_reason":"stop",
"stop_reason":106,
"token_ids":null
}
],
"service_tier":null,
"system_fingerprint":null,
"usage":{
"prompt_tokens":431,
"total_tokens":436,
"completion_tokens":5,
"prompt_tokens_details":null
},
"prompt_logprobs":null,
"prompt_token_ids":null,
"kv_transfer_params":null
}
Verify Health Endpoints#
You can verify that the NIM container is running and ready to accept requests by
checking its health endpoints. By default, these endpoints are served on port
8000. If you set NIM_HEALTH_PORT, use that port instead.
Live Endpoint#
Perform a liveness check to see if the server is running:
curl http://localhost:8000/v1/health/live
Example response:
GET /v1/health/live HTTP/1.1
Host: localhost:8000
User-Agent: curl/7.81.0
Accept: */*
HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Content-Type: application/json
Content-Length: 61
Connection: keep-alive
Cache-Control: no-store, no-cache, must-revalidate
{
"object": "health.response",
"message": "live",
"status": "live"
}
Ready Endpoint#
Perform a readiness check to see if the model is fully loaded and ready for inference:
curl http://localhost:8000/v1/health/ready
Example response:
GET /v1/health/ready HTTP/1.1
Host: localhost:8000
User-Agent: curl/7.81.0
Accept: */*
HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Content-Type: application/json
Content-Length: 63
Connection: keep-alive
Cache-Control: no-store, no-cache, must-revalidate
{
"object": "health.response",
"message": "ready",
"status": "ready"
}
Streaming#
To receive responses incrementally as they are generated, enable streaming by
adding "stream": true to your request payload. This is supported for
/v1/chat/completions and /v1/completions.
When streaming is enabled, the API returns a sequence of Server-Sent Events (SSE).
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<model-name>",
"messages": [
{
"role": "user",
"content": "Write a short poem about a robot."
}
],
"max_tokens": 100,
"stream": true
}'
The response is streamed in chunks, each with a data JSON object. The stream
terminates with a data: [DONE] message:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-content-safety","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-content-safety","choices":[{"index":0,"delta":{"content":"In"},"logprobs":null,"finish_reason":null,"token_ids":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-content-safety","choices":[{"index":0,"delta":{"content":" cir"},"logprobs":null,"finish_reason":null,"token_ids":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"nvidia/nemotron-3-content-safety","choices":[{"index":0,"delta":{"content":"cuits"},"logprobs":null,"finish_reason":null,"token_ids":null}]}
...
data: [DONE]