nemoretriever-parse Overview#
nemoretriever-parse is a tiny autoregressive Visual Language Model (VLM) designed for document transcription from images. It outputs text in reading order. The launch guide and request examples can be found in the API examples. nemoretriever-parse leverages Commercial Radio (C-RADIO) [1] for visual feature extraction and mBART [2] as the decoder for generating text outputs. nemoretriever-parse operates in the following three distinct modes, each tailored to specific use cases for document transcription and formatting:
- Bounding box, class, and markdown mode:
This is the default mode. This mode extracts the bounding boxes of the text regions within the document image, classifies the content type (for example, header, paragraph, and image caption), and outputs the text in markdown format. This mode is useful for retaining the spatial layout of the document and adding semantic structure through classification, making the transcription more organized and context-aware. Markdown format helps preserve the structural elements of the document such as headings, lists, and bold/italic text.
- No Bounding box, no class, and markdown mode:
This mode outputs the transcribed text in markdown format but without the bounding box information. It focuses solely on the content and structure of the document and disregards the spatial positioning of the text elements. This mode is ideal when the primary concern is capturing the hierarchical and semantic structure of the document (through markdown) without the need to retain the layout details.
- Detection only mode:
This mode detects and classifies the text regions within the document image; it doesn’t perform transcription or output any markdown. This mode returns the bounding boxes and content classifications, which is useful in scenarios where users need to process or annotate document regions without requiring immediate text extraction. This mode is optimal for pre-processing steps or cases where layout analysis takes priority over textual content extraction.
These modes offer flexibility in handling a variety of document transcription tasks, from full document layout understanding (markdown_bbox
) to pure text extraction (markdown_no_bbox
).
Architecture Details#
Vision Backbone |
cRadio - VIT - H |
Decoder |
|
Resolution (height x width) |
2048x1648 |
Min Input Normalization |
0.0 |
Max Input Normalization |
1.0 |
Tokenizer Vocab |
52326 tokens. This model uses the same tokenizer as mBART but with added tokens as follows:
|
Output Classes |
|
Prompting Mechanism |
|
Output Format#
Prompt tool type |
Output |
---|---|
|
Outputs bounding box information, text, and class (type). The output is formatted as a list of JSONs: [
{
"bbox": {
"xmin": 0.16633729456384325,
"ymin": 0.0969,
"xmax": 0.3097820480404551,
"ymax": 0.1102
},
"text": "## 1 Introduction",
"type": "Section-header"
}
]
|
|
Outputs markdown in reading order but no bounding box information: {
"text": "## 1 Introduction\n\nRecurrent neural networks,
long short-term memory and gated recurrent neural
networks in particular, have been firmly established as
state of the art approaches in sequence modeling and
transduction problems such as language modeling and
machine translation. At each step the model is
auto-regressive, consuming the previously generated
symbols as additional input when generating the next.\n\n"
}
|
|
Outputs bounding box information and class (type). The output is formatted as a list of JSONs: [
{
"bbox": {
"xmin": 0.16633729456384325,
"ymin": 0.0969,
"xmax": 0.3097820480404551,
"ymax": 0.1102
},
"type": "Section-header"
}
]
|
When predicted, bounding boxes appear in the output with their start x and y coordinates as <xmin><ymin>
. The top left corner of an input image is (0,0). The ending x and y coordinates are defined in the output as <xmax><ymax>
.
When predicted, the class information is encoded in the output as type
, where type
belongs to one of the defined 13 classes.
Input PDF Processing#
You can use pdf2image
or fitz
to convert input PDFs to 300 dpi images. You should use a high batch size to reduce the cost of inference.
For pdf2image
:
from pdf2image import convert_from_path
Images = convert_from_path(<path>, dpi=300, use_cropbox=True)
For fitz
:
import fitz
from PIL import Image
def extract_page_from_pdf_new(args, pdf_path, page_num, dpi=300):
# print(f"Page Extracting from : {page_num} from {pdf_path}")
pdf_document = fitz.open(pdf_path)
page = pdf_document.load_page(page_num)
# Calculate the zoom factor based on the desired DPI
zoom = dpi / 72 # PDF default resolution is 72 DPI
mat = fitz.Matrix(zoom, zoom)
# Render page to an image
pix = page.get_pixmap(matrix=mat)
# Convert to PIL Image
image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
return image
Troubleshooting#
My network is showing corrupted/repeated output#
This typically indicates that the network is experiencing hallucination, a common issue in generative models where the output becomes incoherent or repetitive. Despite this, all outputs generated before the onset of hallucination are generally still valid and can be used.
References