<no title> — NVIDIA NIM for Vision Language Models (VLMs)

Vision Backbone	cRadio - VIT - H
Decoder	mBART
Resolution (height x width)	2048x1648
Min Input Normalization	0.0
Max Input Normalization	1.0
Tokenizer Vocab	52326 tokens. This model uses the same tokenizer as mBART but with added tokens as follows: Prompts: 5 Object location: 1648 for width and 2048 for height Class: 13
Output Classes	Text: Regular paragraph text Title Section-header List-item: Any list item (numbered, alphanumeric or bullet point) TOC: Table of Contents Bibliography Footnote Page-header Page-footer Picture Formula Table Caption (table or picture)
Prompting Mechanism	Bounding box, class, and markdown mode: `bos_token_id` `output_markdown_index` `predict_m_classes_index` `predict_bbox_index` No bounding box, no class, and markdown mode: `bos_token_id` `output_markdown_index`