Skip to content

Source and Content Metadata Reference for NV-Ingest

This documentation contains the reference for the content metadata. The definitions used in this documentation are the following:

  • Source — The file that is ingested, and from which content and metadata is extracted.
  • Content — Data extracted from a source, such as text or an image.

Metadata can be extracted from a source or content, or generated by using models, heuristics, or other methods.

Source File Metadata

The following is the metadata for source files.

Field Description Method
Source Name The name of the source file. Extracted
Source ID The ID of the source file. Extracted
Source location The URL, URI, or pointer to the storage location of the source file.
Source Type The type of the source file, such as pdf, docx, pptx, or txt. Extracted
Collection ID The ID of the collection in which the source is contained.
Date Created The date the source was created. Extracted
Last Modified The date the source was last modified. Extracted
Partition ID The offset of this data fragment within a larger set of fragments. Generated
Access Level The role-based access control for the source.
Summary A summary of the source. (Not yet implemented.) Generated

Content Metadata

The following is the metadata for content. These fields apply to all content types including text, images, and tables.

Field Description Method
Type The type of the content. Text, Image, Structured, Table, or Chart. Generated
Subtype The type of the content for structured data types, such as table or chart.
Content Content extracted from the source. Extracted
Description A text description of the content object. Generated
Page # The page # of the content in the source. Extracted
Hierarchy The location or order of the content within the source. Extracted

Text Metadata

The following is the metadata for text.

Field Description Method
Text Type The type of the text, such as header or body. Extracted
Keywords Keywords, Named Entities, or other phrases. Extracted
Language The language of the content. Generated
Summary An abbreviated summary of the content. (Not yet implemented.) Generated

Image Metadata

The following is the metadata for images.

Field Description Method
Image Type The type of the image, such as structured, natural, hybrid, and others. Generated (Classifier)
Structured Image Type The type of the content for structured data types, such as bar chart, pie chart, and others. Generated (Classifier)
Caption Any caption or subheading associated with Image Extracted
Text Extracted text from a structured chart Extracted
Image location Location (x,y) of chart within an image Extracted
Image location max dimensions Max dimensions (x_max,y_max) of location (x,y) Extracted
uploaded_image_uri Mirrors source_metadata.source_location

Table Metadata

The following is the metadata for tables within documents.

Warning

Tables should not be chunked

Field Description Method
Table format Structured (dataframe / lists of rows and columns), or serialized as markdown, html, latex, simple (cells separated as spaces). Extracted
Table content Extracted text content, formatted according to table_metadata.table_format. Extracted
Table location The bounding box of the table. Extracted
Table location max dimensions The max dimensions (x_max,y_max) of the bounding box of the table. Extracted
Caption The caption for the table or chart. Extracted
Title The title of the table. Extracted
Subtitle The subtitle of the table. Extracted
Axis Axis information for the table. Extracted
uploaded_image_uri A mirror of source_metadata.source_location. Generated

Example Metadata

The following is an example JSON representation of metadata. This is an example only, and does not contain the full metadata. For the full file, refer to the data folder.

{
    "document_type": "text",
    "metadata": 
    {
        "content": "TestingDocument...",
        "content_url": "",
        "source_metadata": 
        {
            "source_name": "data/multimodal_test.pdf",
            "source_id": "data/multimodal_test.pdf",
            "source_location": "",
            "source_type": "PDF",
            "collection_id": "",
            "date_created": "2025-03-13T18:37:14.715892",
            "last_modified": "2025-03-13T18:37:14.715534",
            "summary": "",
            "partition_id": -1,
            "access_level": 1
        },
        "content_metadata": 
        {
            "type": "structured",
            "description": "Structured chart extracted from PDF document.",
            "page_number": 1,
            "hierarchy": 
            {
                "page_count": 3,
                "page": 1,
                "block": -1,
                "line": -1,
                "span": -1,
                "nearby_objects": 
                {
                    "text": 
                    {
                        "content": [],
                        "bbox": [],
                        "type": []
                    },
                    "images": 
                    {
                        "content": [],
                        "bbox": [],
                        "type": []
                    },
                    "structured": 
                    {
                        "content": [],
                        "bbox": [],
                        "type": []
                    }
                }
            },
            "subtype": "chart"
        },
        "audio_metadata": null,
        "text_metadata": null,
        "image_metadata": null,
        "table_metadata": 
        {
            "caption": "",
            "table_format": "image",
            "table_content": "Below,is a high-quality picture of some shapes          Picture",
            "table_content_format": "",
            "table_location": 
            [
                74,
                614,
                728,
                920
            ],
            "table_location_max_dimensions": 
            [
                792,
                1024
            ],
            "uploaded_image_uri": ""
        },
        "chart_metadata": null,
        "error_metadata": null,
        "info_message_metadata": null,
        "debug_metadata": null,
        "raise_on_failure": false
    }
}