Source and Content Metadata Reference for NV-Ingest
This documentation contains the reference for the content metadata. The definitions used in this documentation are the following:
- Source — The file that is ingested, and from which content and metadata is extracted.
- Content — Data extracted from a source, such as text or an image.
Metadata can be extracted from a source or content, or generated by using models, heuristics, or other methods.
Source File Metadata
The following is the metadata for source files.
Field | Description | Method |
---|---|---|
Source Name | The name of the source file. | Extracted |
Source ID | The ID of the source file. | Extracted |
Source location | The URL, URI, or pointer to the storage location of the source file. | — |
Source Type | The type of the source file, such as pdf, docx, pptx, or txt. | Extracted |
Collection ID | The ID of the collection in which the source is contained. | — |
Date Created | The date the source was created. | Extracted |
Last Modified | The date the source was last modified. | Extracted |
Partition ID | The offset of this data fragment within a larger set of fragments. | Generated |
Access Level | The role-based access control for the source. | — |
Summary | A summary of the source. (Not yet implemented.) | Generated |
Content Metadata
The following is the metadata for content. These fields apply to all content types including text, images, and tables.
Field | Description | Method |
---|---|---|
Type | The type of the content. Text, Image, Structured, Table, or Chart. | Generated |
Subtype | The type of the content for structured data types, such as table or chart. | — |
Content | Content extracted from the source. | Extracted |
Description | A text description of the content object. | Generated |
Page # | The page # of the content in the source. | Extracted |
Hierarchy | The location or order of the content within the source. | Extracted |
Text Metadata
The following is the metadata for text.
Field | Description | Method |
---|---|---|
Text Type | The type of the text, such as header or body. | Extracted |
Keywords | Keywords, Named Entities, or other phrases. | Extracted |
Language | The language of the content. | Generated |
Summary | An abbreviated summary of the content. (Not yet implemented.) | Generated |
Image Metadata
The following is the metadata for images.
Field | Description | Method |
---|---|---|
Image Type | The type of the image, such as structured, natural, hybrid, and others. | Generated (Classifier) |
Structured Image Type | The type of the content for structured data types, such as bar chart, pie chart, and others. | Generated (Classifier) |
Caption | Any caption or subheading associated with Image | Extracted |
Text | Extracted text from a structured chart | Extracted |
Image location | Location (x,y) of chart within an image | Extracted |
Image location max dimensions | Max dimensions (x_max,y_max) of location (x,y) | Extracted |
uploaded_image_uri | Mirrors source_metadata.source_location | — |
Table Metadata
The following is the metadata for tables within documents.
Warning
Tables should not be chunked
Field | Description | Method |
---|---|---|
Table format | Structured (dataframe / lists of rows and columns), or serialized as markdown, html, latex, simple (cells separated as spaces). | Extracted |
Table content | Extracted text content, formatted according to table_metadata.table_format. | Extracted |
Table location | The bounding box of the table. | Extracted |
Table location max dimensions | The max dimensions (x_max,y_max) of the bounding box of the table. | Extracted |
Caption | The caption for the table or chart. | Extracted |
Title | The title of the table. | Extracted |
Subtitle | The subtitle of the table. | Extracted |
Axis | Axis information for the table. | Extracted |
uploaded_image_uri | A mirror of source_metadata.source_location. | Generated |
Example Metadata
The following is an example JSON representation of metadata. This is an example only, and does not contain the full metadata. For the full file, refer to the data folder.
{
"document_type": "text",
"metadata":
{
"content": "TestingDocument...",
"content_url": "",
"source_metadata":
{
"source_name": "data/multimodal_test.pdf",
"source_id": "data/multimodal_test.pdf",
"source_location": "",
"source_type": "PDF",
"collection_id": "",
"date_created": "2025-03-13T18:37:14.715892",
"last_modified": "2025-03-13T18:37:14.715534",
"summary": "",
"partition_id": -1,
"access_level": 1
},
"content_metadata":
{
"type": "structured",
"description": "Structured chart extracted from PDF document.",
"page_number": 1,
"hierarchy":
{
"page_count": 3,
"page": 1,
"block": -1,
"line": -1,
"span": -1,
"nearby_objects":
{
"text":
{
"content": [],
"bbox": [],
"type": []
},
"images":
{
"content": [],
"bbox": [],
"type": []
},
"structured":
{
"content": [],
"bbox": [],
"type": []
}
}
},
"subtype": "chart"
},
"audio_metadata": null,
"text_metadata": null,
"image_metadata": null,
"table_metadata":
{
"caption": "",
"table_format": "image",
"table_content": "Below,is a high-quality picture of some shapes Picture",
"table_content_format": "",
"table_location":
[
74,
614,
728,
920
],
"table_location_max_dimensions":
[
792,
1024
],
"uploaded_image_uri": ""
},
"chart_metadata": null,
"error_metadata": null,
"info_message_metadata": null,
"debug_metadata": null,
"raise_on_failure": false
}
}