Vision Backbone |
cRadio - VIT - H |
Decoder |
mBART |
Resolution (height x width) |
2048x1648 |
Min Input Normalization |
0.0 |
Max Input Normalization |
1.0 |
Tokenizer Vocab |
52326 tokens. This model uses the same tokenizer as mBART but with added tokens as follows:
|
Output Classes |
Text: Regular paragraph text
Title
Section-header
List-item: Any list item (numbered, alphanumeric or bullet point)
TOC: Table of Contents
Bibliography
Footnote
Page-header
Page-footer
Picture
Formula
Table
Caption (table or picture)
|
Prompting Mechanism |
Bounding box, class, and markdown mode:
bos_token_id
output_markdown_index
predict_m_classes_index
predict_bbox_index
No bounding box, no class, and markdown mode:
bos_token_id
output_markdown_index
|