Llama 3.2 API#

Launch NIM#

The following command launches a Docker container for this specific model:

# Choose a container name for bookkeeping
export CONTAINER_NAME=meta-llama-3-2-11b-vision-instruct

# The container name from the previous ngc registry image list command
Repository="llama-3.2-11b-vision-instruct"
Latest_Tag="1.1.0"

# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/meta/${Repository}:${Latest_Tag}"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME

OpenAI Chat Completion Request#

The Chat Completions endpoint is typically used with chat or instruct tuned models designed for a conversational approach. With the endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation. To stream the result, set "stream": true.

Important

Update the model name according to the model you are running.

For example, for a meta/llama-3.2-11b-vision-instruct model, you might provide the URL of an image and query the NIM server from command line:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "meta/llama-3.2-11b-vision-instruct",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                            }
                    }
                ]
            }
        ],
        "max_tokens": 256
    }'

Alternatively, you can use the OpenAI Python SDK library

pip install -U openai

Run the client and query the chat completion API:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                }
            }
        ]
    }
]
chat_response = client.chat.completions.create(
    model="meta/llama-3.2-11b-vision-instruct",
    messages=messages,
    max_tokens=256,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Passing Images#

NIM for VLMs follows the OpenAI specification to pass images as part of the HTTP payload in a user message.

Important

Supported image formats are JPG, JPEG and PNG.

Public direct URL

Passing the direct URL of an image will cause the container to download that image at runtime.

{
    "type": "image_url",
    "image_url": {
        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
    }
}

Base64 data

Another option, useful for images not already on the web, is to first base64-encode the image bytes and send that in your payload.

{
    "type": "image_url",
    "image_url": {
        "url": "data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
    }
}

To convert images to base64, you can use the base64 command, or in python:

with open("image.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

Text-only support

Some clients may not support this vision extension of the chat API. NIM for VLMs exposes a way to send your images using the text-only fields, using HTML <img> tags (ensure that you correctly escape quotes):

{
    "role": "user",
    "content": "What is in this image? <img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\" />"
}

This is also compatible with the base64 representation.

{
    "role": "user",
    "content": "What is in this image? <img src=\"data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ==\" />"
}

Text-only Queries#

Many VLMs such as meta/llama-3.2-11b-vision-instruct support text-only queries, where a VLM behaves exactly like a (text-only) LLM.

Important

Text-only capability is not available for all VLMs. Please refer to the model cards in Support Matrix for support on text-only queries.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "meta/llama-3.2-11b-vision-instruct",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant"
            },
            {
                "role": "user",
                "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
            }
        ],
        "max_tokens": 256
    }'

Or using the OpenAI SDK:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant"
    },
    {
        "role": "user",
        "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
    }
]
chat_response = client.chat.completions.create(
    model="meta/llama-3.2-11b-vision-instruct",
    messages=messages,
    max_tokens=256,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Multi-turn Conversation#

Instruction-tuned VLMs may also support multi-turn conversations with repeated interactions between the user and the model.

Important

Multi-turn capability is not available for all VLMs. Please refer to the model cards for information on multi-turn conversations.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "meta/llama-3.2-11b-vision-instruct",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                            }
                    }
                ]
            },
            {
                "role": "assistant",
                "content": "This image shows a boardwalk in a field of tall grass. ..."
            },
            {
                "role": "user",
                "content": "What would be the best season to visit this place?"
            }
        ],
        "max_tokens": 256
    }'

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                }
            }
        ]
    },
    {
        "role": "assistant",
        "content": "This image shows a boardwalk in a field of tall grass. ..."
    },
    {
        "role": "user",
        "content": "What would be the best season to visit this place?"
    }
]
chat_response = client.chat.completions.create(
    model="meta/llama-3.2-11b-vision-instruct",
    messages=messages,
    max_tokens=256,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Using LangChain#

NIM for VLMs allows seamless integration with LangChain, a framework for developing applications powered by large language models (LLMs).

Install LangChain using the following command:

pip install -U langchain-openai langchain-core

Query the OpenAI Chat Completions endpoint using LangChain:

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

model = ChatOpenAI(
    model="meta/llama-3.2-11b-vision-instruct",
    openai_api_base="http://0.0.0.0:8000/v1",
    openai_api_key="not-needed"
)

message = HumanMessage(
    content=[
        {"type": "text", "text": "What is in this image?"},
        {
            "type": "image_url",
            "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"},
        },
    ],
)

print(model.invoke([message]))

Llama Stack Chat Completion Request#

NIM for VLMs additionally supports the Llama Stack Client inference API for Llama VLMs, such as meta/llama-3.2-11b-vision-instruct. With the Llama Stack API, developers can easily integrate Llama VLMs into their applications. To stream the result, set "stream": true.

Important

Update the model name according to the model you are running.

For example, for a meta/llama-3.2-11b-vision-instruct model, you might provide the URL of an image and query the NIM server from the command line:

curl -X 'POST' \
'http://0.0.0.0:8000/inference/chat_completion' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "meta/llama-3.2-11b-vision-instruct",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "image":
                            {
                                "uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                            }
                    },
                    "What is in this image?"
                ]
            }
        ]
    }'

Alternatively, you can use the Llama Stack Client Python Library

pip install llama-stack-client==0.0.50

Important

The examples below assume llama-stack-client version 0.0.50. Modify the requests accordingly if you choose to install a newer version.

Run the client and query the chat completion API:

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url="http://0.0.0.0:8000")

messages = [
    {
        "role": "user",
        "content": [
            {
                "image": {
                    "uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                }
            },
            "What is in this image?"
        ]
    }
]

iterator = client.inference.chat_completion(
    model="meta/llama-3.2-11b-vision-instruct",
    messages=messages,
    stream=True
)

for chunk in iterator:
    print(chunk.event.delta, end="", flush=True)

Passing Images#

NIM for VLMs follows the Llama Stack specification to pass images as part of the HTTP payload in a user message.

Important

Supported image formats are JPG, JPEG and PNG.

Public direct URL

Passing the direct URL of an image will cause the container to download that image at runtime.

{
    "image": {
        "uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
    }
}

Base64 data

Another option, useful for images not already on the web, is to first base64-encode the image bytes and send them in your payload.

{
    "image": {
        "uri": "data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
    }
}

Text-only support

As in the OpenAI API case, NIM for VLMs exposes a way to send your images using the text-only fields, using HTML <img> tags:

{
    "role": "user",
    "content": "What is in this image?<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\" />"
}

This is also compatible with the base64 representation.

{
    "role": "user",
    "content": "What is in this image?<img src=\"data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ==\" />"
}

Text-only Queries#

Many VLMs such as meta/llama-3.2-11b-vision-instruct support text-only queries, where a VLM behaves exactly like a (text-only) LLM.

Important

Text-only capability is not available for all VLMs. Please consult model cards in Support Matrix for support on text-only queries.

curl -X 'POST' \
'http://0.0.0.0:8000/inference/chat_completion' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "meta/llama-3.2-11b-vision-instruct",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant"
            },
            {
                "role": "user",
                "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
            }
        ]
    }'

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url=f"http://0.0.0.0:8000")

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant"
    },
    {
        "role": "user",
        "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
    }
]

iterator = client.inference.chat_completion(
    model="meta/llama-3.2-11b-vision-instruct",
    messages=messages,
    stream=True
)

for chunk in iterator:
    print(chunk.event.delta, end="", flush=True)

Multi-turn Conversation#

Instruction-tuned VLMs may also support multi-turn conversation with repeated interactions between a user and the model.

Important

Multi-turn capability is not available for all VLMs. Please consult model cards for support on multi-turn conversation.

curl -X 'POST' \
'http://0.0.0.0:8000/inference/chat_completion' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "meta/llama-3.2-11b-vision-instruct",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "image":
                            {
                                "uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                            }
                    },
                    "What is in this image?"
                ]
            },
            {
                "role": "assistant",
                "content": "This image shows a boardwalk in a field of tall grass. ...",
                "stop_reason": "end_of_turn"
            },
            {
                "role": "user",
                "content": "What would be the best season to visit this place?"
            }
        ]
    }'

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url=f"http://0.0.0.0:8000")

messages = [
    {
        "role": "user",
        "content": [
            {
                "image": {
                    "uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                }
            },
            "What is in this image?"
        ]
    },
    {
        "role": "assistant",
        "content": "This image shows a boardwalk in a field of tall grass. ...",
        "stop_reason": "end_of_turn"
    },
    {
        "role": "user",
        "content": "What would be the best season to visit this place?"
    }
]

iterator = client.inference.chat_completion(
    model="meta/llama-3.2-11b-vision-instruct",
    messages=messages,
    stream=True
)

for chunk in iterator:
    print(chunk.event.delta, end="", flush=True)