Llama 3.2 API#
Launch NIM#
The following command launches a Docker container for this specific model:
# Choose a container name for bookkeeping
export CONTAINER_NAME=meta-llama-3-2-11b-vision-instruct
# The container name from the previous ngc registry image list command
Repository="llama-3.2-11b-vision-instruct"
Latest_Tag="1.1.0"
# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/meta/${Repository}:${Latest_Tag}"
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
OpenAI Chat Completion Request#
The Chat Completions endpoint is typically used with chat
or instruct
tuned models designed for a conversational approach. With the endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation. To stream the result, set "stream": true
.
Important
Update the model name according to the model you are running.
For example, for a meta/llama-3.2-11b-vision-instruct
model, you might provide the URL of an image and query the NIM server from command line:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.2-11b-vision-instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url":
{
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}
],
"max_tokens": 256
}'
Alternatively, you can use the OpenAI Python SDK library
pip install -U openai
Run the client and query the chat completion API:
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}
]
chat_response = client.chat.completions.create(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
max_tokens=256,
stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Passing Images#
NIM for VLMs follows the OpenAI specification to pass images as part of the HTTP payload in a user message.
Important
Supported image formats are JPG, JPEG and PNG.
Public direct URL
Passing the direct URL of an image will cause the container to download that image at runtime.
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
Base64 data
Another option, useful for images not already on the web, is to first base64-encode the image bytes and send that in your payload.
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
}
}
To convert images to base64, you can use the base64
command, or in python:
with open("image.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
Text-only support
Some clients may not support this vision extension of the chat API. NIM for VLMs exposes a way to send your images using the text-only fields, using HTML <img>
tags (ensure that you correctly escape quotes):
{
"role": "user",
"content": "What is in this image? <img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\" />"
}
This is also compatible with the base64 representation.
{
"role": "user",
"content": "What is in this image? <img src=\"data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ==\" />"
}
Text-only Queries#
Many VLMs such as meta/llama-3.2-11b-vision-instruct
support text-only queries, where a VLM behaves exactly like a (text-only) LLM.
Important
Text-only capability is not available for all VLMs. Please refer to the model cards in Support Matrix for support on text-only queries.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.2-11b-vision-instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
}
],
"max_tokens": 256
}'
Or using the OpenAI SDK:
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
}
]
chat_response = client.chat.completions.create(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
max_tokens=256,
stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Multi-turn Conversation#
Instruction-tuned VLMs may also support multi-turn conversations with repeated interactions between the user and the model.
Important
Multi-turn capability is not available for all VLMs. Please refer to the model cards for information on multi-turn conversations.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.2-11b-vision-instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url":
{
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
},
{
"role": "assistant",
"content": "This image shows a boardwalk in a field of tall grass. ..."
},
{
"role": "user",
"content": "What would be the best season to visit this place?"
}
],
"max_tokens": 256
}'
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
},
{
"role": "assistant",
"content": "This image shows a boardwalk in a field of tall grass. ..."
},
{
"role": "user",
"content": "What would be the best season to visit this place?"
}
]
chat_response = client.chat.completions.create(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
max_tokens=256,
stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Using LangChain#
NIM for VLMs allows seamless integration with LangChain, a framework for developing applications powered by large language models (LLMs).
Install LangChain using the following command:
pip install -U langchain-openai langchain-core
Query the OpenAI Chat Completions endpoint using LangChain:
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
model = ChatOpenAI(
model="meta/llama-3.2-11b-vision-instruct",
openai_api_base="http://0.0.0.0:8000/v1",
openai_api_key="not-needed"
)
message = HumanMessage(
content=[
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"},
},
],
)
print(model.invoke([message]))
Llama Stack Chat Completion Request#
NIM for VLMs additionally supports the Llama Stack Client inference API for Llama VLMs, such as meta/llama-3.2-11b-vision-instruct
. With the Llama Stack API, developers can easily integrate Llama VLMs into their applications. To stream the result, set "stream": true
.
Important
Update the model name according to the model you are running.
For example, for a meta/llama-3.2-11b-vision-instruct
model, you might provide the URL of an image and query the NIM server from the command line:
curl -X 'POST' \
'http://0.0.0.0:8000/inference/chat_completion' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.2-11b-vision-instruct",
"messages": [
{
"role": "user",
"content": [
{
"image":
{
"uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
},
"What is in this image?"
]
}
]
}'
Alternatively, you can use the Llama Stack Client Python Library
pip install llama-stack-client==0.0.50
Important
The examples below assume llama-stack-client version 0.0.50. Modify the requests accordingly if you choose to install a newer version.
Run the client and query the chat completion API:
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url="http://0.0.0.0:8000")
messages = [
{
"role": "user",
"content": [
{
"image": {
"uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
},
"What is in this image?"
]
}
]
iterator = client.inference.chat_completion(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
stream=True
)
for chunk in iterator:
print(chunk.event.delta, end="", flush=True)
Passing Images#
NIM for VLMs follows the Llama Stack specification to pass images as part of the HTTP payload in a user message.
Important
Supported image formats are JPG, JPEG and PNG.
Public direct URL
Passing the direct URL of an image will cause the container to download that image at runtime.
{
"image": {
"uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
Base64 data
Another option, useful for images not already on the web, is to first base64-encode the image bytes and send them in your payload.
{
"image": {
"uri": "data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
}
}
Text-only support
As in the OpenAI API case, NIM for VLMs exposes a way to send your images using the text-only fields, using HTML <img>
tags:
{
"role": "user",
"content": "What is in this image?<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\" />"
}
This is also compatible with the base64 representation.
{
"role": "user",
"content": "What is in this image?<img src=\"data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ==\" />"
}
Text-only Queries#
Many VLMs such as meta/llama-3.2-11b-vision-instruct
support text-only queries, where a VLM behaves exactly like a (text-only) LLM.
Important
Text-only capability is not available for all VLMs. Please consult model cards in Support Matrix for support on text-only queries.
curl -X 'POST' \
'http://0.0.0.0:8000/inference/chat_completion' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.2-11b-vision-instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
}
]
}'
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url=f"http://0.0.0.0:8000")
messages = [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
}
]
iterator = client.inference.chat_completion(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
stream=True
)
for chunk in iterator:
print(chunk.event.delta, end="", flush=True)
Multi-turn Conversation#
Instruction-tuned VLMs may also support multi-turn conversation with repeated interactions between a user and the model.
Important
Multi-turn capability is not available for all VLMs. Please consult model cards for support on multi-turn conversation.
curl -X 'POST' \
'http://0.0.0.0:8000/inference/chat_completion' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.2-11b-vision-instruct",
"messages": [
{
"role": "user",
"content": [
{
"image":
{
"uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
},
"What is in this image?"
]
},
{
"role": "assistant",
"content": "This image shows a boardwalk in a field of tall grass. ...",
"stop_reason": "end_of_turn"
},
{
"role": "user",
"content": "What would be the best season to visit this place?"
}
]
}'
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url=f"http://0.0.0.0:8000")
messages = [
{
"role": "user",
"content": [
{
"image": {
"uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
},
"What is in this image?"
]
},
{
"role": "assistant",
"content": "This image shows a boardwalk in a field of tall grass. ...",
"stop_reason": "end_of_turn"
},
{
"role": "user",
"content": "What would be the best season to visit this place?"
}
]
iterator = client.inference.chat_completion(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
stream=True
)
for chunk in iterator:
print(chunk.event.delta, end="", flush=True)