Query the Llama 3.1 Nemotron Nano VL 8B v1 API#

For more information on this model, see the model card on build.nvidia.com.

Launch NIM#

The following command launches a Docker container for this specific model. For information on getting the NGC API key that this code uses, see Get Started with NIM. For information on parameters, see Docker Run Parameters.

# Choose a container name for bookkeeping
export CONTAINER_NAME=nvidia-llama-3.1-nemotron-nano-vl-8b-v1

# The container name from the previous ngc registry image list command
Repository="llama-3.1-nemotron-nano-vl-8b-v1"
Latest_Tag="1.3.0"

# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/nvidia/${Repository}:${Latest_Tag}"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME

OpenAI Chat Completion Request#

The Chat Completions endpoint is typically used with chat or instruct tuned models designed for a conversational approach. With the endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation.

Important

Update the model name according to the model you are running.

For example, for a nvidia/llama-3.1-nemotron-nano-vl-8b-v1 model, you might provide the URL of an image and query the NIM server from command line.

Important

For better results, make sure to put the image before any text part in the request body.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
          "model": "nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
          "messages": [
            {
              "role": "user",
              "content": [
                {
                  "type": "image_url",
                  "image_url":
                    {
                      "url": "https://assets.ngc.nvidia.com/products/api-catalog/llama-cosmos-nemotron-8b-instruct/specifications.png"
                    }
                },
                {
                  "type": "text",
                  "text": "How many times more \"FP16Tensor Core\" does H100 SXM have than H100 NVL?"
                }
              ]
            }
          ],
          "stream": false,
          "max_tokens": 256,
          "temperature": 0
    }'

to get a response like the following:

{
  "id": "chatcmpl-ee7a9e12322546f58db8be608941310b",
  "object": "chat.completion",
  "created": 1750180162,
  "model": "nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": null,
        "content": "The H100 SXM has 1,979 FP16 Tensor Cores, while the H100 NVL has 1,671 FP16 Tensor Cores. To find out how many times more FP16 Tensor Cores the H100 SXM has compared to the H100 NVL, we divide the number of FP16 Tensor Cores in the H100 SXM by the number in the H100 NVL:\n\n1,979 / 1,671 ≈ 1.18\n\nTherefore, the H100 SXM has approximately 1.18 times more FP16 Tensor Cores than the H100 NVL.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 3372,
    "total_tokens": 3498,
    "completion_tokens": 126,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

To stream the result, set "stream": true in the request. The response streams back, divided into chunks as follows:

data: {"id":"chatcmpl-7869d4d8b5314504a9820bc99cdc19aa","object":"chat.completion.chunk","created":1750182242,
"model":"nvidia/llama-3.1-nemotron-nano-vl-8b-v1","choices":[{"index":0,"delta":{"role":"assistant","content":""},
"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-7869d4d8b5314504a9820bc99cdc19aa","object":"chat.completion.chunk","created":1750182242,
"model":"nvidia/llama-3.1-nemotron-nano-vl-8b-v1","choices":[{"index":0,"delta":{"content":"The"},"logprobs":null,"finish_reason":null}]}

...

You can also use the OpenAI Python SDK library. Install the library:

pip install -U openai

Run the client and query the chat completion API:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
  {
    "role": "user",
    "content": [
      {
        "type": "image_url",
        "image_url": {
          "url": "https://assets.ngc.nvidia.com/products/api-catalog/llama-cosmos-nemotron-8b-instruct/specifications.png"
        }
      },
      {
        "type": "text",
        "text": "How many times more \"FP16Tensor Core\" does H100 SXM have than H100 NVL?"
      }
    ]
  }
]
chat_response = client.chat.completions.create(
    model="nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
    messages=messages,
    stream=False,
    max_tokens=256,
    temperature=0,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

# Response:
# ChatCompletionMessage(content='H100 SXM has 1.979 teraFLOPS, while H100 NVL
# has 1.671 teraFLOPS. Therefore, H100 SXM has 1.979/1.671 = 1.18 times more
# FP16 Tensor Core than the H100 NVL.', refusal=None, role='assistant',
# annotations=None, audio=None, function_call=None, tool_calls=None)

Passing Images#

NIM for VLMs follows the OpenAI specification to pass images as part of the HTTP payload in a user message.

Important

Supported image formats are JPG, JPEG and PNG.

Public direct URL

Passing the direct URL of an image will cause the container to download that image at runtime.

{
  "type": "image_url",
  "image_url": {
    "url": "https://assets.ngc.nvidia.com/products/api-catalog/llama-cosmos-nemotron-8b-instruct/specifications.png"
  }
}

Base64 data

Another option, useful for images not already on the web, is to first base64-encode the image bytes and send that in your payload.

{
  "type": "image_url",
  "image_url": {
    "url": "data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
  }
}

To convert images to base64, you can use the base64 command, or in python:

with open("image.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

Text-only support

Some clients may not support this vision extension of the chat API. NIM for VLMs exposes a way to send your images using the text-only fields, using HTML <img> tags (ensure that you correctly escape quotes):

{
  "role": "user",
  "content": "<img src=\"https://assets.ngc.nvidia.com/products/api-catalog/llama-cosmos-nemotron-8b-instruct/specifications.png\" /> How many times more \"FP16Tensor Core\" does H100 SXM have than H100 NVL?"
}

This is also compatible with the base64 representation.

{
  "role": "user",
  "content": "<img src=\"data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ==\" /> How many times more \"FP16Tensor Core\" does H100 SXM have than H100 NVL?"
}

Text-only Queries#

Many VLMs such as nvidia/llama-3.1-nemotron-nano-vl-8b-v1 support text-only queries, where a VLM behaves exactly like a (text-only) LLM.

Important

Text-only capability is not available for all VLMs. Please refer to the model cards in Support Matrix for support on text-only queries.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
         "model": "nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
         "messages": [
           {
             "role": "system",
             "content": "You are a helpful assistant"
           },
           {
             "role": "user",
             "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
           }
         ],
         "max_tokens": 256,
         "temperature": 0
    }'

Or using the OpenAI SDK:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
  {
    "role": "system",
    "content": "You are a helpful assistant"
  },
  {
    "role": "user",
    "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
  }
]
chat_response = client.chat.completions.create(
    model="nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
    messages=messages,
    max_tokens=256,
    temperature=0,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Multi-turn Conversation#

Instruction-tuned VLMs may also support multi-turn conversations with repeated interactions between the user and the model.

Important

Multi-turn capability is not available for all VLMs. Please refer to the model cards for information on multi-turn conversations.

curl -X 'POST'   'http://0.0.0.0:8000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
        "model": "nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://assets.ngc.nvidia.com/products/api-catalog/llama-cosmos-nemotron-8b-instruct/specifications.png"
                        }
                    },
                    {
                        "type": "text",
                        "text": "How many times more \"FP16Tensor Core\" does H100 SXM have than H100 NVL?"
                    }
                ]
            },
            {
                "role": "assistant",
                "content": "The H100 SXM has 1,979 FP16 Tensor Cores, while the H100 NVL has 1,671 FP16 Tensor Cores. To find out how many times more FP16 Tensor Cores the H100 SXM has compared to the H100 NVL, we divide the number of FP16 Tensor Cores in the H100 SXM by the number in the H100 NVL:\n\n1,979 / 1,671 ≈ 1.18\n\nTherefore, the H100 SXM has approximately 1.18 times more FP16 Tensor Cores than the H100 NVL."
            },
            {
                "role": "user",
                "content": "Please extract the table in the image as HTML"
            }
        ],
        "max_tokens": 1024,
        "temperature": 0
      }'

to get a response like the following:

{
  "id": "chatcmpl-a13eee2a16844486824c36778300b711",
  "object": "chat.completion",
  "created": 1750180700,
  "model": "nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": null,
        "content": "<table>\n  <tr>\n    <td colspan=\"3\"><b>Technical Specifications</b></td>\n  </tr>\n
          <tr>\n    <td></td>\n    <td><b>H100 SXM</b></td>\n    <td><b>H100 NVL</b></td>\n  </tr>\n
          <tr>\n    <td>FP64</td>\n    <td>34 teraFLOPS</td>\n    <td>30 teraFLOPS</td>\n  </tr>\n
          ...
          </tr>\n</table>",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 3517,
    "total_tokens": 4446,
    "completion_tokens": 929,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

Or using the OpenAI SDK:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
  {
    "role": "user",
    "content": [
      {
        "type": "image_url",
        "image_url": {
          "url": "https://assets.ngc.nvidia.com/products/api-catalog/llama-cosmos-nemotron-8b-instruct/specifications.png"
        }
      },
      {
        "type": "text",
        "text": "How many times more \"FP16Tensor Core\" does H100 SXM have than H100 NVL?"
      }
    ]
  },
  {
    "role": "assistant",
    "content": "H100 SXM has 1.979 teraFLOPS, while H100 NVL has 1.671 teraFLOPS. Therefore, H100 SXM has 1.979/1.671 = 1.18 times more FP16 Tensor Core than the H100 NVL."
  },
  {
    "role": "user",
    "content": "Please extract the table in the image as HTML"
  }
]
chat_response = client.chat.completions.create(
    model="nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
    messages=messages,
    max_tokens=1024,
    temperature=0,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Using LangChain#

NIM for VLMs allows seamless integration with LangChain, a framework for developing applications powered by large language models (LLMs).

Install LangChain using the following command:

pip install -U langchain-openai langchain-core 'httpx<=0.27.2'
# For an issue with httpx, see https://community.openai.com/t/error-with-openai-1-56-0-client-init-got-an-unexpected-keyword-argument-proxies/1040332/3.

Query the OpenAI Chat Completions endpoint using LangChain.

Important

Make sure to put the image before the text query in the request body.

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

model = ChatOpenAI(
    model="nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
    openai_api_base="http://0.0.0.0:8000/v1",
    openai_api_key="not-needed",
    temperature=0
)

message = HumanMessage(
    content=[
        {
          "type": "image_url",
          "image_url": {
            "url": "https://assets.ngc.nvidia.com/products/api-catalog/llama-cosmos-nemotron-8b-instruct/specifications.png"
          },
        },
        {
          "type": "text",
          "text": "How many times more \"FP16Tensor Core\" does H100 SXM have than H100 NVL?"
        },
    ],
)

print(model.invoke([message]))