nemoretriever-parse API#

Launch NIM#

The following command launches a Docker container for this specific model:

# Choose a container name for bookkeeping
export CONTAINER_NAME=nemoretriever-parse

# The container name from the previous ngc registry image list command
Repository="nemoretriever-parse"
Latest_Tag="1.2.0"

# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/nvidia/${Repository}:${Latest_Tag}"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME

Request examples#

To configure how text regions are extracted and transcribed, you will need to specify a tools parameter in your request. You can only specify one tool type per request.

Note

Text input is not supported.

cURL#

In a cURL request, you can provide an image URL or send the image data as a Base64-encoded string (see Passing images). The following request uses an image URL:

curl -X 'POST' \
  'http://0.0.0.0:8000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "nvidia/nemoretriever-parse",
    "tools": [
      {
        "type": "function",
        "function":
          {
            "name": "markdown_bbox"
          }
      }
    ],
    "messages": [
      {
        "role":"user",
        "content": [
          {
            "type": "image_url",
            "image_url":
              {
                "url": "https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/5090.png"
              }
          }
        ]
      }
    ]
  }'

Note

LangChain and LlamaStack integrations are not supported.

Python#

You can provide an image URL or send the image data as a Base64-encoded string (see Passing images). The following example uses image data encoded as a Base64-encoded string:

import json
from openai import OpenAI

client = OpenAI(
  base_url = "http://0.0.0.0:8000/v1",
  # For local deployment, an API key is not needed but the following string
  # varible should be non-empty.
  api_key = "<api-key-placeholder-keep-non-empty>"
)

completion = client.chat.completions.create(
    model="nvidia/nemoretriever-parse",
    # See Tool types section for more information.
    tools=[{"type": "function", "function": {"name": "markdown_bbox"}}],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        # You can provide the image URL or send the image
                        # data as a Base64-encoded string.
                        "url": "...ciBmZWxsb3chIQ=="
                    },
                },
            ]
        }
    ],
)
tool_call = completion.choices[0].message.tool_calls[0]
results_of_detection = json.loads(tool_call.function.arguments)[0]

Passing images#

NIM for VLMs follows the OpenAI specification to pass images as part of the HTTP payload in a user message.

Public direct URL

Passing the direct URL of an image will cause the container to download that image at runtime.

{
  "type": "image_url",
  "image_url": {
    "url": "https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/5090.png"
  }
}

Base64 data

Another option, useful for images not already on the web, is to first base64-encode the image bytes and send that in your payload.

{
  "type": "image_url",
  "image_url": {
    "url": "...ciBmZWxsb3chIQ=="
  }
}

To convert images to base64, you can use the base64 command, or in Python:

with open("image.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

Tool types#

Only one tool type may be specified per request. Choose one of the following options:

  • markdown_bbox: Default type. Extracts the bounding boxes of the text regions within the document image, classifies the content type, and outputs the text in markdown format.

  • markdown_no_bbox: Outputs the transcribed text in markdown format but without the bounding box information.

  • detection_only: Extracts the bounding boxes of the text regions within the document image and classifies the content type, but doesn’t perform transcription or output any markdown.

Note

For best output, use markdown_bbox.

Response formats#

For tool types markdown_bbox and detection_only, four bounding box coordinates are output per retrieved object. These coordinates range from 0.0 to 1.0. The top left corner of an input image is (0,0).

  • xmin: Horizontal position of the top left corner of a bounding box

  • ymin: Vertical position of the top left corner of a bounding box

  • xmax: Horizontal position of the bottom right corner of a bounding box

  • ymax: Vertical position of the bottom right corner of a bounding box

markdown_bbox mode#

[
  {
    "bbox": {
      "xmin": 0.16633729456384325,
      "ymin": 0.0969,
      "xmax": 0.3097820480404551,
      "ymax": 0.1102
    },
    "text": "## 1 Introduction",
    "type": "Section-header"
  },
  {
    "bbox": {
      "xmin": 0.16633729456384325,
      "ymin": 0.1336,
      "xmax": 0.8316834386852087,
      "ymax": 0.1977
    },
    "text": "Recurrent neural networks, long short-term memory and gated recurrent neural networks in particular, ...",
    "type": "Text"
  },
  {
    "bbox": {
      "xmin": 0.16633729456384325,
      "ymin": 0.8086,
      "xmax": 0.3697850821744627,
      "ymax": 0.8219
    },
    "text": "## 3 Model Architecture",
    "type": "Section-header"
  },
  {
    "bbox": {
      "xmin": 0.16633729456384325,
      "ymin": 0.8414,
      "xmax": 0.8347044247787612,
      "ymax": 0.907
    },
    "text": "Most competitive neural sequence transduction models have an encoder-decoder structure.",
    "type": "Text"
  },
  {
    "bbox": {
      "xmin": 0.4959372945638433,
      "ymin": 0.9344,
      "xmax": 0.5040627054361567,
      "ymax": 0.9453
    },
    "text": "2",
    "type": "Page-footer"
  }
]

markdown_no_bbox mode#

Note

For best output, use markdown_bbox mode instead.

{
    "text": "## 1 Introduction\n\nRecurrent neural networks, long
    short-term memory and gated recurrent neural networks in particular,
    have been firmly established as state of the art approaches in sequence
    modeling and transduction problems such as language modeling and
    machine translation. At each step the model is auto-regressive,
    consuming the previously generated symbols as additional input when
    generating the next.\n\n"
}

detection_only mode#

Note

For best output, use markdown_bbox mode instead.

[
  {
    "bbox": {
      "xmin": 0.16633729456384325,
      "ymin": 0.0953,
      "xmax": 0.3087403286978508,
      "ymax": 0.1086
    },
    "type": "Section-header"
  },
  {
    "bbox": {
      "xmin": 0.166,
      "ymin": 0.1305,
      "xmax": 0.8306417193426043,
      "ymax": 0.1984
    },
    "type": "Text"
  },

    ...

  {
    "bbox": {
      "xmin": 0.4959372945638433,
      "ymin": 0.9344,
      "xmax": 0.5040627054361567,
      "ymax": 0.9445
    },
    "type": "Page-footer"
  }
]

Bounding box detection script#

You can use the following example Python script to display the predicted bounding boxes on a given input image. Use this script to verify the accuracy of bounding box detection and label classification.

from PIL import Image, ImageDraw, ImageFont

detections = results_of_detection  # Detection results from the python openai example

image = Image.open("example.png")
draw = ImageDraw.Draw(image)
width, height = image.size
colors = ["red", "green", "blue", "yellow", "magenta", "cyan", "orange", "purple"]

try:
    font = ImageFont.truetype("arial.ttf", 16)
except IOError:
    font = ImageFont.load_default()

for i, det in enumerate(detections):
    bbox = det["bbox"]
    # Convert normalized coordinates to pixel values.
    left = bbox["xmin"] * width
    top = bbox["ymin"] * height
    right = bbox["xmax"] * width
    bottom = bbox["ymax"] * height
    # Choose a color for the box.
    color = colors[i % len(colors)]
    # Draw the bounding box with a 3-pixel thick outline.
    draw.rectangle([left, top, right, bottom], outline=color, width=3)

    # Use the 'type' key as the label title.
    label = det.get("type", "")
    # Instead of measuring text size, use a fixed-size background.
    fixed_label_width = 80    # Fixed width for the label background
    fixed_label_height = 20   # Fixed height for the label background

    # Draw a filled rectangle above the bounding box for the label background.
    draw.rectangle([left, top - fixed_label_height, left + fixed_label_width, top], fill=color)
    # Draw the label text (with a small padding).
    draw.text((left + 3, top - fixed_label_height + 3), label, fill="black", font=font)

image.save("example_result.png")