Query the Mistral Large 3 675B Instruct 2512 API#

This page shows how to launch the NIM container and call the Chat Completions API with curl, the OpenAI Python SDK, and LangChain. It covers image inputs, text-only queries, multi-turn conversations, and function calling.

For more information on this model, refer to the model card on build.nvidia.com.

Launch NIM#

The following command launches a Docker container for this specific model:

# Choose a container name for bookkeeping
export CONTAINER_NAME=mistralai-mistral-large-3-675b-instruct-2512

# The container name from the previous ngc registry image list command
Repository="mistral-large-3-675b-instruct-2512"
Latest_Tag="1.6.0"

# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/mistralai/${Repository}:${Latest_Tag}"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=32GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME

Note

The -u $(id -u) option in the above docker run command is to ensure that the UID in the spawned container is the same as that of the user on the host. This option is usually recommended if the $LOCAL_NIM_CACHE path on the host has permissions that forbid other users from writing to it.

OpenAI Chat Completion Request#

The Chat Completions endpoint is typically used with chat or instruct tuned models designed for a conversational approach. With this endpoint, prompts are sent in the form of messages with roles and content, giving a natural way to keep track of a multi-turn conversation. To stream the response, set "stream": true.

Note

Most of the snippets below use a max_tokens value. This is mainly for illustration purposes where the output is unlikely to be much longer, and to ensure the generation requests terminate at a reasonable length. For reasoning examples, where it is common for there to be a lot more output tokens, that upper bound is raised compared to non-reasoning examples.

For example, for a mistralai/mistral-large-3-675b-instruct-2512 model, you might provide the URL of an image and query the NIM server from the command line:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "mistralai/mistral-large-3-675b-instruct-2512",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
                            }
                    }
                ]
            }
        ],
        "max_tokens": 1024
    }'

You can include "stream": true in the preceding request body for streaming responses.

Alternatively, you can use the OpenAI Python SDK library

pip install -U openai

Run the client and query the Chat Completions API:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
                }
            }
        ]
    }
]
chat_response = client.chat.completions.create(
    model="mistralai/mistral-large-3-675b-instruct-2512",
    messages=messages,
    max_tokens=1024,
    stream=False,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

The above code snippet can be adapted to handle streaming responses as follows:

# Code preceding `client.chat.completions.create` is the same.
stream = client.chat.completions.create(
    model="mistralai/mistral-large-3-675b-instruct-2512",
    messages=messages,
    max_tokens=1024,
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta and delta.content:
        text = delta.content
        # Print immediately and without a newline to update the output as the response is
        # streamed in.
        print(text, end="", flush=True)
# Final newline.
print()

Passing images#

NIM for VLMs follows the OpenAI specification to pass images as part of the HTTP payload in a user message.

Important

The supported image formats are GIF, JPG, JPEG, and PNG.

To adjust the maximum number of images allowed per request, set the environment variable NIM_MAX_IMAGES_PER_PROMPT. The default value is 10.

Public direct URL

Pass the direct URL of an image to cause the container to download that image at runtime.

{
    "type": "image_url",
    "image_url": {
        "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
    }
}

Base64 data

Another option, useful for images not already on the web, is to base64-encode the image bytes and send that in your payload.

{
    "type": "image_url",
    "image_url": {
        "url": "...ciBmZWxsb3chIQ=="
    }
}

To convert images to base64, use the base64 command or in Python:

with open("image.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

Function (Tool) Calling#

You can connect NIM to external tools and services using function calling (also known as tool calling). For more information, refer to Call Functions (Tools).

Text-only Queries#

Many VLMs such as mistralai/mistral-large-3-675b-instruct-2512 support text-only queries, where a VLM behaves exactly like a (text-only) LLM.

Important

Text-only capability is not available for all VLMs. Refer to the model cards in Support Matrix for support on text-only queries.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "mistralai/mistral-large-3-675b-instruct-2512",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant"
            },
            {
                "role": "user",
                "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
            }
        ],
        "max_tokens": 4096
    }'

Or using the OpenAI SDK:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant"
    },
    {
        "role": "user",
        "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
    }
]
chat_response = client.chat.completions.create(
    model="mistralai/mistral-large-3-675b-instruct-2512",
    messages=messages,
    max_tokens=4096,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Multi-turn Conversation#

Instruction-tuned VLMs may also support multi-turn conversations with repeated interactions between the user and the model.

Important

Multi-turn capability is not available for all VLMs. Refer to the model cards for information on multi-turn conversations.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "mistralai/mistral-large-3-675b-instruct-2512",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
                            }
                    }
                ]
            },
            {
                "role": "assistant",
                "content": "This image shows a boardwalk in a field of tall grass. ..."
            },
            {
                "role": "user",
                "content": "What would be the best season to visit this place?"
            }
        ],
        "max_tokens": 4096
    }'
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
                }
            }
        ]
    },
    {
        "role": "assistant",
        "content": "This image shows a boardwalk in a field of tall grass. ..."
    },
    {
        "role": "user",
        "content": "What would be the best season to visit this place?"
    }
]
chat_response = client.chat.completions.create(
    model="mistralai/mistral-large-3-675b-instruct-2512",
    messages=messages,
    max_tokens=4096,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Using LangChain#

NIM for VLMs allows seamless integration with LangChain, a framework for developing applications powered by large language models (LLMs).

Install LangChain using the following command:

pip install -U langchain-openai langchain-core

Query the OpenAI Chat Completions endpoint using LangChain:

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

model = ChatOpenAI(
    model="mistralai/mistral-large-3-675b-instruct-2512",
    openai_api_base="http://0.0.0.0:8000/v1",
    openai_api_key="not-needed"
)

message = HumanMessage(
    content=[
        {"type": "text", "text": "What is in this image?"},
        {
            "type": "image_url",
            "image_url": {"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"},
        },
    ],
)

print(model.invoke([message]))