nemoretriever-parse API#
Launch NIM#
The following command launches a Docker container for this specific model:
# Choose a container name for bookkeeping
export CONTAINER_NAME=nemoretriever-parse
# The container name from the previous ngc registry image list command
Repository="nemoretriever-parse"
Latest_Tag="1.2.0"
# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/nvidia/${Repository}:${Latest_Tag}"
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
Request examples#
To configure how text regions are extracted and transcribed, you will need to
specify a tools
parameter in your request. You can only specify one
tool type per request.
Note
Text input is not supported.
cURL#
In a cURL
request, you can provide an image URL or send the image data as a
Base64-encoded string (see Passing images). The
following request uses an image URL:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/nemoretriever-parse",
"tools": [
{
"type": "function",
"function":
{
"name": "markdown_bbox"
}
}
],
"messages": [
{
"role":"user",
"content": [
{
"type": "image_url",
"image_url":
{
"url": "https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/5090.png"
}
}
]
}
]
}'
Note
LangChain and LlamaStack integrations are not supported.
Python#
You can provide an image URL or send the image data as a Base64-encoded string (see Passing images). The following example uses image data encoded as a Base64-encoded string:
import json
from openai import OpenAI
client = OpenAI(
base_url = "http://0.0.0.0:8000/v1",
# For local deployment, an API key is not needed but the following string
# varible should be non-empty.
api_key = "<api-key-placeholder-keep-non-empty>"
)
completion = client.chat.completions.create(
model="nvidia/nemoretriever-parse",
# See Tool types section for more information.
tools=[{"type": "function", "function": {"name": "markdown_bbox"}}],
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
# You can provide the image URL or send the image
# data as a Base64-encoded string.
"url": "data:image/png;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
},
},
]
}
],
)
tool_call = completion.choices[0].message.tool_calls[0]
results_of_detection = json.loads(tool_call.function.arguments)[0]
Passing images#
NIM for VLMs follows the OpenAI specification to pass images as part of the HTTP payload in a user message.
Public direct URL
Passing the direct URL of an image will cause the container to download that image at runtime.
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/5090.png"
}
}
Base64 data
Another option, useful for images not already on the web, is to first base64-encode the image bytes and send that in your payload.
{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
}
}
To convert images to base64, you can use the base64
command, or in Python:
with open("image.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
Tool types#
Only one tool type may be specified per request. Choose one of the following options:
markdown_bbox
: Default type. Extracts the bounding boxes of the text regions within the document image, classifies the content type, and outputs the text in markdown format.markdown_no_bbox
: Outputs the transcribed text in markdown format but without the bounding box information.detection_only
: Extracts the bounding boxes of the text regions within the document image and classifies the content type, but doesn’t perform transcription or output any markdown.
Note
For best output, use markdown_bbox
.
Response formats#
For tool types markdown_bbox
and detection_only
, four bounding box coordinates are output per retrieved object. These coordinates range from 0.0 to 1.0. The top left corner of an input image is (0,0).
xmin
: Horizontal position of the top left corner of a bounding boxymin
: Vertical position of the top left corner of a bounding boxxmax
: Horizontal position of the bottom right corner of a bounding boxymax
: Vertical position of the bottom right corner of a bounding box
markdown_bbox
mode#
[
{
"bbox": {
"xmin": 0.16633729456384325,
"ymin": 0.0969,
"xmax": 0.3097820480404551,
"ymax": 0.1102
},
"text": "## 1 Introduction",
"type": "Section-header"
},
{
"bbox": {
"xmin": 0.16633729456384325,
"ymin": 0.1336,
"xmax": 0.8316834386852087,
"ymax": 0.1977
},
"text": "Recurrent neural networks, long short-term memory and gated recurrent neural networks in particular, ...",
"type": "Text"
},
{
"bbox": {
"xmin": 0.16633729456384325,
"ymin": 0.8086,
"xmax": 0.3697850821744627,
"ymax": 0.8219
},
"text": "## 3 Model Architecture",
"type": "Section-header"
},
{
"bbox": {
"xmin": 0.16633729456384325,
"ymin": 0.8414,
"xmax": 0.8347044247787612,
"ymax": 0.907
},
"text": "Most competitive neural sequence transduction models have an encoder-decoder structure.",
"type": "Text"
},
{
"bbox": {
"xmin": 0.4959372945638433,
"ymin": 0.9344,
"xmax": 0.5040627054361567,
"ymax": 0.9453
},
"text": "2",
"type": "Page-footer"
}
]
markdown_no_bbox
mode#
Note
For best output, use markdown_bbox
mode instead.
{
"text": "## 1 Introduction\n\nRecurrent neural networks, long
short-term memory and gated recurrent neural networks in particular,
have been firmly established as state of the art approaches in sequence
modeling and transduction problems such as language modeling and
machine translation. At each step the model is auto-regressive,
consuming the previously generated symbols as additional input when
generating the next.\n\n"
}
detection_only
mode#
Note
For best output, use markdown_bbox
mode instead.
[
{
"bbox": {
"xmin": 0.16633729456384325,
"ymin": 0.0953,
"xmax": 0.3087403286978508,
"ymax": 0.1086
},
"type": "Section-header"
},
{
"bbox": {
"xmin": 0.166,
"ymin": 0.1305,
"xmax": 0.8306417193426043,
"ymax": 0.1984
},
"type": "Text"
},
...
{
"bbox": {
"xmin": 0.4959372945638433,
"ymin": 0.9344,
"xmax": 0.5040627054361567,
"ymax": 0.9445
},
"type": "Page-footer"
}
]
Bounding box detection script#
You can use the following example Python script to display the predicted bounding boxes on a given input image. Use this script to verify the accuracy of bounding box detection and label classification.
from PIL import Image, ImageDraw, ImageFont
detections = results_of_detection # Detection results from the python openai example
image = Image.open("example.png")
draw = ImageDraw.Draw(image)
width, height = image.size
colors = ["red", "green", "blue", "yellow", "magenta", "cyan", "orange", "purple"]
try:
font = ImageFont.truetype("arial.ttf", 16)
except IOError:
font = ImageFont.load_default()
for i, det in enumerate(detections):
bbox = det["bbox"]
# Convert normalized coordinates to pixel values.
left = bbox["xmin"] * width
top = bbox["ymin"] * height
right = bbox["xmax"] * width
bottom = bbox["ymax"] * height
# Choose a color for the box.
color = colors[i % len(colors)]
# Draw the bounding box with a 3-pixel thick outline.
draw.rectangle([left, top, right, bottom], outline=color, width=3)
# Use the 'type' key as the label title.
label = det.get("type", "")
# Instead of measuring text size, use a fixed-size background.
fixed_label_width = 80 # Fixed width for the label background
fixed_label_height = 20 # Fixed height for the label background
# Draw a filled rectangle above the bounding box for the label background.
draw.rectangle([left, top - fixed_label_height, left + fixed_label_width, top], fill=color)
# Draw the label text (with a small padding).
draw.text((left + 3, top - fixed_label_height + 3), label, fill="black", font=font)
image.save("example_result.png")