Vision Backbone

cRadio - VIT - H

Decoder

mBART

Resolution (height x width)

2048x1648

Min Input Normalization

0.0

Max Input Normalization

1.0

Tokenizer Vocab

52326 tokens. This model uses the same tokenizer as mBART but with added tokens as follows:

  • Prompts: 5

  • Object location: 1648 for width and 2048 for height

  • Class: 13

Output Classes

  • Text: Regular paragraph text

  • Title

  • Section-header

  • List-item: Any list item (numbered, alphanumeric or bullet point)

  • TOC: Table of Contents

  • Bibliography

  • Footnote

  • Page-header

  • Page-footer

  • Picture

  • Formula

  • Table

  • Caption (table or picture)

Prompting Mechanism

  • Bounding box, class, and markdown mode:

    • bos_token_id

    • output_markdown_index

    • predict_m_classes_index

    • predict_bbox_index

  • No bounding box, no class, and markdown mode:

    • bos_token_id

    • output_markdown_index