nemoretriever-parse Overview#

nemoretriever-parse is a tiny autoregressive Visual Language Model (VLM) designed for document transcription from images. It outputs text in reading order. The launch guide and request examples can be found in the API examples. nemoretriever-parse leverages Commercial Radio (C-RADIO) [1] for visual feature extraction and mBART [2] as the decoder for generating text outputs. nemoretriever-parse operates in the following three distinct modes, each tailored to specific use cases for document transcription and formatting:

Bounding box, class, and markdown mode:
This is the default mode. This mode extracts the bounding boxes of the text regions within the document image, classifies the content type (for example, header, paragraph, and image caption), and outputs the text in markdown format. This mode is useful for retaining the spatial layout of the document and adding semantic structure through classification, making the transcription more organized and context-aware. Markdown format helps preserve the structural elements of the document such as headings, lists, and bold/italic text.
No Bounding box, no class, and markdown mode:
This mode outputs the transcribed text in markdown format but without the bounding box information. It focuses solely on the content and structure of the document and disregards the spatial positioning of the text elements. This mode is ideal when the primary concern is capturing the hierarchical and semantic structure of the document (through markdown) without the need to retain the layout details.
Detection only mode:
This mode detects and classifies the text regions within the document image; it doesn’t perform transcription or output any markdown. This mode returns the bounding boxes and content classifications, which is useful in scenarios where users need to process or annotate document regions without requiring immediate text extraction. This mode is optimal for pre-processing steps or cases where layout analysis takes priority over textual content extraction.

These modes offer flexibility in handling a variety of document transcription tasks, from full document layout understanding (markdown_bbox) to pure text extraction (markdown_no_bbox).

Architecture Details#

Vision Backbone	cRadio - VIT - H
Decoder	mBART
Resolution (height x width)	2048x1648
Min Input Normalization	0.0
Max Input Normalization	1.0
Tokenizer Vocab	52326 tokens. This model uses the same tokenizer as mBART but with added tokens as follows: Prompts: 5 Object location: 1648 for width and 2048 for height Class: 13
Output Classes	Text: Regular paragraph text Title Section-header List-item: Any list item (numbered, alphanumeric or bullet point) TOC: Table of Contents Bibliography Footnote Page-header Page-footer Picture Formula Table Caption (table or picture)
Prompting Mechanism	Bounding box, class, and markdown mode: `bos_token_id` `output_markdown_index` `predict_m_classes_index` `predict_bbox_index` No bounding box, no class, and markdown mode: `bos_token_id` `output_markdown_index`

Output Format#

Prompt tool type

Output

markdown_with_bbox

Outputs bounding box information, text, and class (type). The output is formatted as a list of JSONs:

[
  {
    "bbox": {
      "xmin": 0.16633729456384325,
      "ymin": 0.0969,
      "xmax": 0.3097820480404551,
      "ymax": 0.1102
    },
    "text": "## 1 Introduction",
    "type": "Section-header"
  }
]

markdown_with_no_bbox

Outputs markdown in reading order but no bounding box information:

{
    "text": "## 1 Introduction\n\nRecurrent neural networks,
    long short-term memory and gated recurrent neural
    networks in particular, have been firmly established as
    state of the art approaches in sequence modeling and
    transduction problems such as language modeling and
    machine translation. At each step the model is
    auto-regressive, consuming the previously generated
    symbols as additional input when generating the next.\n\n"
}

detection_only

Outputs bounding box information and class (type). The output is formatted as a list of JSONs:

[
  {
    "bbox": {
      "xmin": 0.16633729456384325,
      "ymin": 0.0969,
      "xmax": 0.3097820480404551,
      "ymax": 0.1102
    },
    "type": "Section-header"
  }
]

When predicted, bounding boxes appear in the output with their start x and y coordinates as <xmin><ymin>. The top left corner of an input image is (0,0). The ending x and y coordinates are defined in the output as <xmax><ymax>.

When predicted, the class information is encoded in the output as type, where type belongs to one of the defined 13 classes.

Input PDF Processing#

You can use pdf2image or fitz to convert input PDFs to 300 dpi images. You should use a high batch size to reduce the cost of inference.

For pdf2image:

from pdf2image import convert_from_path

Images = convert_from_path(<path>, dpi=300, use_cropbox=True)

For fitz:

import fitz
from PIL import Image

def extract_page_from_pdf_new(args, pdf_path, page_num, dpi=300):
    # print(f"Page  Extracting from : {page_num} from {pdf_path}")
    pdf_document = fitz.open(pdf_path)
    page = pdf_document.load_page(page_num)

    # Calculate the zoom factor based on the desired DPI
    zoom = dpi / 72  # PDF default resolution is 72 DPI
    mat = fitz.Matrix(zoom, zoom)

    # Render page to an image
    pix = page.get_pixmap(matrix=mat)

    # Convert to PIL Image
    image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    return image

Troubleshooting#

My network is showing corrupted/repeated output#

This typically indicates that the network is experiencing hallucination, a common issue in generative models where the output becomes incoherent or repetitive. Despite this, all outputs generated before the onset of hallucination are generally still valid and can be used.

References