Quickstart#

How to Choose the Right Model Profile#

NIM LLM automatically selects the most optimal model profile based on the detected hardware (for example, number of GPUs or GPU architecture). If you need to manually override this selection, you can specify the NIM_MODEL_PROFILE environment variable. For more information, refer to Model Profiles.

Run NIM#

Before running a NIM LLM container, make sure you have fulfilled all prerequisites, and have followed the steps in the installation and configuration steps. This includes setting your API keys, logging in to Docker, pulling the container image, and configuring your LOCAL_NIM_CACHE.

Tip

Mounting a local cache directory lets you avoid re-downloading the model on subsequent restarts. Refer to Local Cache for details.

Note

The exact image tag varies depending on the container type and backend you intend to deploy. Set ${NIM_LLM_IMAGE} and ${NIM_LLM_MODEL_FREE_IMAGE} to the vLLM or SGLang image and tag for your model and GPU. Check the Support Matrix for the image and version to use.

Model-Specific NIM

Run the container using your NGC API Key to authenticate and download the model.

docker run --gpus=all \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -p 8000:8000 \
  ${NIM_LLM_IMAGE}

Model-Free NIM

Run the container using your Hugging Face token to authenticate and download the model.

docker run --gpus=all \
  -e NIM_MODEL_PATH=$NIM_MODEL_PATH \
  -e HF_TOKEN=$HF_TOKEN \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -p 8000:8000 \
  ${NIM_LLM_MODEL_FREE_IMAGE}

Note

If you want to serve a pre-downloaded local model or a private cloud model instead of downloading one from Hugging Face, refer to Model Downloads for your workflow.

Interact with the API#

There are three main inference endpoints:

Chat Completions: /v1/chat/completions
Text Completions: /v1/completions
Responses: /v1/responses

Tip

All three endpoints support streaming.

Find the Model Name#

Replace <model-name> in the examples below with the model name served by your NIM container. To find it, query the models endpoint:

curl -s http://localhost:8000/v1/models

The id field in the response is the model name to use in your requests. For model-specific NIMs, this matches the model identifier (for example, meta/llama-3.1-8b-instruct). For model-free NIMs, the name is derived from the container image.

Note

The model name is user-configurable by setting the NIM_SERVED_MODEL_NAME environment variable. For more information, refer to Environment Variables.

Send a Chat Completion Request#

Once the server is running, you can send a request to the chat completion endpoint:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<model-name>",
    "messages": [
      {
        "role": "user",
        "content": "Hello! How are you?"
      }
    ],
    "max_tokens": 100
  }'

The expected response:

{
  "id": "chatcmpl-87d0c4524fb6f1a4",
  "object": "chat.completion",
  "created": 1769635152,
  "model": "<model-name>",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I'm functioning properly,",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": null
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 41,
    "total_tokens": 46,
    "completion_tokens": 5,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

Verify Health Endpoints#

You can verify that the NIM container is running and ready to accept requests by checking its health endpoints. By default, these endpoints are served on port 8000. If you set NIM_HEALTH_PORT, use that port instead.

Live Endpoint#

Perform a liveness check to see if the server is running:

curl http://localhost:8000/v1/health/live

Example response:

GET /v1/health/live HTTP/1.1
Host: localhost:8000
User-Agent: curl/7.81.0
Accept: */*

HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Content-Type: application/json
Content-Length: 61
Connection: keep-alive
Cache-Control: no-store, no-cache, must-revalidate

{
  "object": "health.response",
  "message": "live",
  "status": "live"
}

Ready Endpoint#

Perform a readiness check to see if the model is fully loaded and ready for inference:

curl http://localhost:8000/v1/health/ready

Example response:

GET /v1/health/ready HTTP/1.1
Host: localhost:8000
User-Agent: curl/7.81.0
Accept: */*

HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Content-Type: application/json
Content-Length: 63
Connection: keep-alive
Cache-Control: no-store, no-cache, must-revalidate

{
  "object": "health.response",
  "message": "ready",
  "status": "ready"
}

Streaming#

To receive responses incrementally as they are generated, you can enable streaming by adding "stream": true to your request payload. This is supported across the /v1/chat/completions, /v1/completions, and /v1/responses endpoints.

When streaming is enabled, the API returns a sequence of Server-Sent Events (SSE).

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<model-name>",
    "messages": [
      {
        "role": "user",
        "content": "Write a short poem about a robot."
      }
    ],
    "max_tokens": 100,
    "stream": true
  }'

The response will be streamed back in chunks, with each chunk containing a data JSON object. The stream terminates with a data: [DONE] message:

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"meta/llama-3.1-8b-instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"meta/llama-3.1-8b-instruct","choices":[{"index":0,"delta":{"content":"In"},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"meta/llama-3.1-8b-instruct","choices":[{"index":0,"delta":{"content":" cir"},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"meta/llama-3.1-8b-instruct","choices":[{"index":0,"delta":{"content":"cuits"},"logprobs":null,"finish_reason":null,"token_ids":null}]}

...

data: [DONE]