Query the Nemotron Parse API#
For more information on this model, see the Nemotron Parse Overview and the model card on build.nvidia.com.
Launch NIM#
The following command launches a Docker container for this specific model:
# Choose a container name for bookkeeping
export CONTAINER_NAME="nvidia-nemotron-parse"
# The container name from the previous ngc registry image list command
Repository="nemotron-parse"
Latest_Tag="1.5.0"
# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/nvidia/${Repository}:${Latest_Tag}"
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
Request examples#
To configure how text regions are extracted and transcribed, you will need to
specify a tools parameter in your request. You can only specify one
tool type per request.
Note
Text input is not supported.
cURL#
In a cURL request, you can provide an image URL or send the image data as a
Base64-encoded string (see Passing images). The
following request uses an
image
URL:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/nemotron-parse",
"tools": [
{
"type": "function",
"function":
{
"name": "markdown_bbox"
}
}
],
"messages": [
{
"role":"user",
"content": [
{
"type": "image_url",
"image_url":
{
"url": "https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/5090.png"
}
}
]
}
],
"temperature": 0.0
}'
Note
LangChain and LlamaStack integrations are not supported.
Python#
You can provide an image URL or send the image data as a Base64-encoded string (see Passing images). The following example uses image data encoded as a Base64-encoded string:
import json
from openai import OpenAI
client = OpenAI(
base_url = "http://0.0.0.0:8000/v1",
# For local deployment, an API key is not needed but the following string
# varible should be non-empty.
api_key = "<api-key-placeholder-keep-non-empty>"
)
completion = client.chat.completions.create(
model="nvidia/nemotron-parse",
# See Tool types section for more information.
tools=[{"type": "function", "function": {"name": "markdown_bbox"}}],
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
# You can provide the image URL or send the image
# data as a Base64-encoded string. Replace the
# string in this example with Base64-encoded image
# data. See the Passing images section for more
# information.
"url": "...ciBmZWxsb3chIQ=="
},
},
]
}
],
temperature=0.0,
)
tool_call = completion.choices[0].message.tool_calls[0]
results_of_detection = json.loads(tool_call.function.arguments)[0]
print(results_of_detection)
Passing images#
NIM for VLMs follows the OpenAI specification to pass images as part of the HTTP payload in a user message.
Public direct URL
Passing the direct URL of an image will cause the container to download that image at runtime.
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/5090.png"
}
}
Base64 data
Another option, useful for images not already on the web, is to first base64-encode the image bytes and send that in your payload.
{
"type": "image_url",
"image_url": {
"url": "...ciBmZWxsb3chIQ=="
}
}
To convert images to base64, you can use the base64 command, or in Python:
with open("image.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
Tool types#
Only one tool type may be specified per request. Choose one of the following options:
markdown_bbox: Default type. Extracts the bounding boxes of the text regions within the document image, classifies the content type, and outputs the text in markdown format.markdown_no_bbox: Outputs the transcribed text in markdown format but without the bounding box information.detection_only: Extracts the bounding boxes of the text regions within the document image and classifies the content type, but doesn’t perform transcription or output any markdown.
Note
For best output, use markdown_bbox.
Response formats#
For tool types markdown_bbox and detection_only, four bounding box
coordinates are output per retrieved object. These coordinates range from 0.0 to
1.0. The top left corner of an input image is (0,0).
xmin: Horizontal position of the top left corner of a bounding boxymin: Vertical position of the top left corner of a bounding boxxmax: Horizontal position of the bottom right corner of a bounding boxymax: Vertical position of the bottom right corner of a bounding box
markdown_bbox mode#
[
{
"bbox": {
"xmin": 0.16633729456384325,
"ymin": 0.0969,
"xmax": 0.3097820480404551,
"ymax": 0.1102
},
"text": "## 1 Introduction",
"type": "Section-header"
},
{
"bbox": {
"xmin": 0.16633729456384325,
"ymin": 0.1336,
"xmax": 0.8316834386852087,
"ymax": 0.1977
},
"text": "Recurrent neural networks, long short-term memory and gated recurrent neural networks in particular, ...",
"type": "Text"
},
{
"bbox": {
"xmin": 0.16633729456384325,
"ymin": 0.8086,
"xmax": 0.3697850821744627,
"ymax": 0.8219
},
"text": "## 3 Model Architecture",
"type": "Section-header"
},
{
"bbox": {
"xmin": 0.16633729456384325,
"ymin": 0.8414,
"xmax": 0.8347044247787612,
"ymax": 0.907
},
"text": "Most competitive neural sequence transduction models have an encoder-decoder structure.",
"type": "Text"
},
{
"bbox": {
"xmin": 0.4959372945638433,
"ymin": 0.9344,
"xmax": 0.5040627054361567,
"ymax": 0.9453
},
"text": "2",
"type": "Page-footer"
}
]
markdown_no_bbox mode#
Note
For best output, use markdown_bbox mode instead.
{
"text": "## 1 Introduction\n\nRecurrent neural networks, long
short-term memory and gated recurrent neural networks in particular,
have been firmly established as state of the art approaches in sequence
modeling and transduction problems such as language modeling and
machine translation. At each step the model is auto-regressive,
consuming the previously generated symbols as additional input when
generating the next.\n\n"
}
detection_only mode#
Note
For best output, use markdown_bbox mode instead.
[
{
"bbox": {
"xmin": 0.16633729456384325,
"ymin": 0.0953,
"xmax": 0.3087403286978508,
"ymax": 0.1086
},
"type": "Section-header"
},
{
"bbox": {
"xmin": 0.166,
"ymin": 0.1305,
"xmax": 0.8306417193426043,
"ymax": 0.1984
},
"type": "Text"
},
...
{
"bbox": {
"xmin": 0.4959372945638433,
"ymin": 0.9344,
"xmax": 0.5040627054361567,
"ymax": 0.9445
},
"type": "Page-footer"
}
]
Bounding box detection script#
You can use the following example Python script to display the predicted bounding boxes on a given input image. Use this script to verify the accuracy of bounding box detection and label classification.
from PIL import Image, ImageDraw, ImageFont
# Copy the previous Python code here to get the results_of_detection
detections = results_of_detection
image = Image.open("example.png")
draw = ImageDraw.Draw(image)
width, height = image.size
colors = ["red", "green", "blue", "yellow", "magenta", "cyan", "orange", "purple"]
try:
font = ImageFont.truetype("arial.ttf", 16)
except IOError:
font = ImageFont.load_default()
for i, det in enumerate(detections):
bbox = det["bbox"]
# Convert normalized coordinates to pixel values.
left = bbox["xmin"] * width
top = bbox["ymin"] * height
right = bbox["xmax"] * width
bottom = bbox["ymax"] * height
# Choose a color for the box.
color = colors[i % len(colors)]
# Draw the bounding box with a 3-pixel thick outline.
draw.rectangle([left, top, right, bottom], outline=color, width=3)
# Use the 'type' key as the label title.
label = det.get("type", "")
# Instead of measuring text size, use a fixed-size background.
fixed_label_width = 80 # Fixed width for the label background
fixed_label_height = 20 # Fixed height for the label background
# Draw a filled rectangle above the bounding box for the label background.
draw.rectangle([left, top - fixed_label_height, left + fixed_label_width, top], fill=color)
# Draw the label text (with a small padding).
draw.text((left + 3, top - fixed_label_height + 3), label, fill="black", font=font)
image.save("example_result.png")