Skip to content

Metadata Reference for NeMo Retriever Extraction

This documentation contains the reference for the metadata used in NeMo Retriever extraction. The definitions used in this documentation are the following:

  • Source — The file that is ingested, and from which content and metadata is extracted.
  • Content — Data extracted from a source, such as text or an image.

Metadata can be extracted from a source or content, or generated by using models, heuristics, or other methods.

Note

NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.

Source File Metadata

The following is the metadata for source files.

Field Description Method
Source Name The name of the source file. Extracted
Source ID The ID of the source file. Extracted
Source location The URL, URI, or pointer to the storage location of the source file.
Source Type The type of the source file, such as pdf, docx, pptx, or txt. Extracted
Collection ID The ID of the collection in which the source is contained.
Date Created The date the source was created. Extracted
Last Modified The date the source was last modified. Extracted
Partition ID The offset of this data fragment within a larger set of fragments. Generated
Access Level The role-based access control for the source.
Summary A summary of the source. (Not yet implemented.) Generated

Content Metadata

The following is the metadata for content. These fields apply to all content types including text, images, and tables.

Field Description Method
Type The type of the content. Text, Image, Structured, Table, or Chart. Generated
Subtype The type of the content for structured data types, such as table or chart.
Content Content extracted from the source. Extracted
Description A text description of the content object. Generated
Page # The page # of the content in the source. Prior to 26.1.1, this field was 0-indexed. Beginning with 26.1.1, this field is 1-indexed. Extracted
Hierarchy The location or order of the content within the source. Extracted

Text Metadata

The following is the metadata for text.

Field Description Method
Text Type The type of the text, such as header or body. Extracted
Keywords Keywords, Named Entities, or other phrases. Extracted
Language The language of the content. Generated
Summary An abbreviated summary of the content. (Not yet implemented.) Generated

Image Metadata

The following is the metadata for images.

Field Description Method
Image Type The type of the image, such as structured, natural, hybrid, and others. Generated (Classifier)
Structured Image Type The type of the content for structured data types, such as bar chart, pie chart, and others. Generated (Classifier)
Caption Any caption or subheading associated with Image Extracted
Text Extracted text from a structured chart Extracted
Image location Location (x,y) of chart within an image Extracted
Image location max dimensions Max dimensions (x_max,y_max) of location (x,y) Extracted
uploaded_image_uri Mirrors source_metadata.source_location

Table Metadata

The following is the metadata for tables within documents.

Warning

Tables should not be chunked

Field Description Method
Table format Structured (dataframe / lists of rows and columns), or serialized as markdown, html, latex, simple (cells separated as spaces). Extracted
Table content Extracted text content, formatted according to table_metadata.table_format. Extracted
Table location The bounding box of the table. Extracted
Table location max dimensions The max dimensions (x_max,y_max) of the bounding box of the table. Extracted
Caption The caption for the table or chart. Extracted
Title The title of the table. Extracted
Subtitle The subtitle of the table. Extracted
Axis Axis information for the table. Extracted
uploaded_image_uri A mirror of source_metadata.source_location. Generated

Metadata Schema Documentation

The following is a detailed explanation of the MetadataSchema and its constituent sub-schemas used within the NVIDIA Ingest Framework. This schema defines the structure for metadata associated with ingested content.

MetadataSchema

The MetadataSchema is the primary container for all metadata. It includes the core content, its URL, embedding, and various specialized metadata blocks.

Field Type Default Value/Behavior Description
content str "" The actual textual content extracted from the source.
content_url str "" URL pointing to the location of the content, if applicable.
embedding Optional[List[float]] None Optional numerical vector representation (embedding) of the content.
source_metadata Optional[SourceMetadataSchema] None Metadata about the original source of the content. See SourceMetadataSchema.
content_metadata Optional[ContentMetadataSchema] None General metadata about the extracted content itself. See ContentMetadataSchema.
audio_metadata Optional[AudioMetadataSchema] None Specific metadata for audio content. Automatically set to None if content_metadata.type is not AUDIO. See AudioMetadataSchema.
text_metadata Optional[TextMetadataSchema] None Specific metadata for text content. Automatically set to None if content_metadata.type is not TEXT. See TextMetadataSchema.
image_metadata Optional[ImageMetadataSchema] None Specific metadata for image content. Automatically set to None if content_metadata.type is not IMAGE. See ImageMetadataSchema.
table_metadata Optional[TableMetadataSchema] None Specific metadata for tabular content. Automatically set to None if content_metadata.type is not STRUCTURED. See TableMetadataSchema.
chart_metadata Optional[ChartMetadataSchema] None Specific metadata for chart content. See ChartMetadataSchema.
error_metadata Optional[ErrorMetadataSchema] None Metadata describing any errors encountered during processing. See ErrorMetadataSchema.
info_message_metadata Optional[InfoMessageMetadataSchema] None Informational messages related to the processing. See InfoMessageMetadataSchema.
debug_metadata Optional[Dict[str, Any]] None A dictionary for storing any arbitrary debug information.
raise_on_failure bool False If True, indicates that processing should halt on failure.

Note: A model_validator ensures that type-specific metadata fields (audio_metadata, image_metadata, text_metadata, table_metadata) are set to None if the content_metadata.type does not match the respective content type.

SourceMetadataSchema

Describes the origin of the ingested content.

Field Type Default Value Description
source_name str Required Name of the source (e.g., filename, URL).
source_id str Required Unique identifier for the source.
source_location str "" Physical or logical location of the source (e.g., path, database table).
source_type Union[DocumentTypeEnum, str] Required Type of the source document (e.g., pdf, docx, url). Uses DocumentTypeEnum.
collection_id str "" Identifier for any collection this source belongs to.
date_created str datetime.now().isoformat() ISO 8601 timestamp of when the source was created. Validated to be in ISO 8601 format.
last_modified str datetime.now().isoformat() ISO 8601 timestamp of when the source was last modified. Validated to be in ISO 8601 format.
summary str "" A brief summary of the source content.
partition_id int -1 Identifier for a partition if the source is part of a larger, partitioned dataset.
access_level Union[AccessLevelEnum, int] AccessLevelEnum.UNKNOWN Access level associated with the source. Uses AccessLevelEnum.

ContentMetadataSchema

General metadata about the extracted content.

Field Type Default Value Description
type ContentTypeEnum Required The type of the extracted content (e.g., TEXT, IMAGE, AUDIO). Uses ContentTypeEnum.
description str "" A description of the extracted content.
page_number int -1 Page number from which the content was extracted, if applicable (e.g., for PDFs).
hierarchy ContentHierarchySchema ContentHierarchySchema() Hierarchical information about the content's location within the source. See ContentHierarchySchema.
subtype Union[ContentTypeEnum, str] "" A more specific subtype for the content (e.g., if type is IMAGE, subtype could be diagram).
start_time int -1 Start time in milliseconds for time-based media (e.g., audio, video).
end_time int -1 End time in milliseconds for time-based media.

ContentHierarchySchema

Describes the structural location of content within a document.

Field Type Default Value Description
page_count int -1 Total number of pages in the document, if applicable.
page int -1 The specific page number where the content resides.
block int -1 Identifier for a block of content (e.g., paragraph, section).
line int -1 Line number within a block, if applicable.
span int -1 Span identifier within a line, for finer granularity.
nearby_objects NearbyObjectsSchema NearbyObjectsSchema() Information about objects (text, images, structured data) near the current content. See NearbyObjectsSchema.

NearbyObjectsSchema (Currently Unused)

Container for different types of nearby objects.

Field Type Default Value Description
text NearbyObjectsSubSchema NearbyObjectsSubSchema() Nearby textual objects. See NearbyObjectsSubSchema.
images NearbyObjectsSubSchema NearbyObjectsSubSchema() Nearby image objects.
structured NearbyObjectsSubSchema NearbyObjectsSubSchema() Nearby structured data objects (e.g., tables).

NearbyObjectsSubSchema

Describes a list of nearby objects of a specific type.

Field Type Default Value Description
content List[str] default_factory=list List of content strings for the nearby objects.
bbox List[tuple] default_factory=list List of bounding boxes (e.g., coordinates) for the nearby objects.
type List[str] default_factory=list List of types for the nearby objects.

TextMetadataSchema

Specific metadata for textual content.

Field Type Default Value Description
text_type TextTypeEnum Required Type of text (e.g., document, title, ocr). Uses TextTypeEnum.
summary str "" A summary of this specific text segment.
keywords Union[str, List[str], Dict] "" Keywords extracted from or associated with the text. Can be a single string, list of strings, or a dictionary.
language LanguageEnum "en" Detected or specified language of the text. Uses LanguageEnum. Defaults to English.
text_location tuple (0, 0, 0, 0) Bounding box or coordinates of the text within its source (e.g., on a page).
text_location_max_dimensions tuple (0, 0, 0, 0) Maximum dimensions of the space where text_location is defined (e.g., page width/height).

ImageMetadataSchema

Specific metadata for image content.

Field Type Default Value Description
image_type Union[DocumentTypeEnum, str] Required Type of the image document (e.g., png, jpeg). Uses DocumentTypeEnum or a string.
structured_image_type ContentTypeEnum ContentTypeEnum.NONE If the image represents structured data (e.g., a table or chart), its ContentTypeEnum.
caption str "" Caption associated with the image.
text str "" Text extracted from the image (e.g., via OCR).
image_location tuple (0, 0, 0, 0) Bounding box or coordinates of the image within its source.
image_location_max_dimensions tuple (0, 0) Maximum dimensions of the space where image_location is defined.
uploaded_image_url str "" URL of the image if it has been uploaded to a separate storage location.
width int 0 Width of the image in pixels. Clamped to be non-negative.
height int 0 Height of the image in pixels. Clamped to be non-negative.

TableMetadataSchema

Specific metadata for tabular content.

Field Type Default Value Description
caption str "" Caption associated with the table.
table_format TableFormatEnum Required Format of the table (e.g., csv, html). Uses TableFormatEnum.
table_content str "" String representation of the table's content (e.g., CSV string, HTML markup).
table_content_format Union[TableFormatEnum, str] "" Specific format of table_content.
table_location tuple (0, 0, 0, 0) Bounding box or coordinates of the table within its source.
table_location_max_dimensions tuple (0, 0) Maximum dimensions of the space where table_location is defined.
uploaded_image_uri str "" URI of an image representation of the table, if applicable.

ChartMetadataSchema

Metadata for table content extracted from charts.

Field Type Default Value Description
caption str "" Caption associated with the chart.
table_format TableFormatEnum Required Underlying data format of the chart (e.g., data might be in csv format). Uses TableFormatEnum.
table_content str "" String representation of the chart's underlying data.
table_content_format Union[TableFormatEnum, str] "" Specific format of table_content.
table_location tuple (0, 0, 0, 0) Bounding box or coordinates of the chart within its source.
table_location_max_dimensions tuple (0, 0) Maximum dimensions of the space where table_location is defined.
uploaded_image_uri str "" URI of an image representation of the chart, if applicable.

AudioMetadataSchema

Specific metadata for audio content.

Field Type Default Value Description
audio_transcript str "" Transcript of the audio content.
audio_type str "" Type or format of the audio (e.g., mp3, wav).

ErrorMetadataSchema (Currently Unused)

Metadata describing errors encountered during processing.

Field Type Default Value Description
task TaskTypeEnum Required The task that was being performed when the error occurred. Uses TaskTypeEnum.
status StatusEnum Required The status indicating failure. Uses StatusEnum.
source_id str "" Identifier of the source item that caused the error, if applicable.
error_msg str Required The error message.

InfoMessageMetadataSchema (Currently Unused)

Informational messages related to processing.

Field Type Default Value Description
task TaskTypeEnum Required The task associated with this informational message. Uses TaskTypeEnum.
status StatusEnum Required The status related to this message (e.g., INFO, WARNING). Uses StatusEnum.
message str Required The informational message content.
filter bool Required A flag indicating if this message should be used for filtering purposes.

Enums

The following enums are used by this schema:

  • AccessLevelEnum – Defines access levels (e.g., PUBLIC, CONFIDENTIAL, UNKNOWN).
  • ContentTypeEnum – Defines types of content (e.g., TEXT, IMAGE, AUDIO, STRUCTURED, NONE).
  • TextTypeEnum – Defines types of text (e.g., DOCUMENT, TITLE, OCR, CAPTION).
  • LanguageEnum – Defines languages (e.g., ENGLISH (en), SPANISH (es)).
  • TableFormatEnum – Defines table formats (e.g., CSV, HTML, TEXT).
  • StatusEnum – Defines processing statuses (e.g., SUCCESS, FAILURE, PROCESSING, INFO, WARNING).
  • DocumentTypeEnum – Defines types of source documents (e.g., PDF, DOCX, TXT, URL, PNG, MP3).
  • TaskTypeEnum – Defines types of processing tasks (e.g., EXTRACTION, EMBEDDING, STORAGE).

Example Metadata

The following is an example JSON representation of metadata. This is an example only, and does not contain the full metadata. For the full file, refer to the data folder.

{
    "document_type": "text",
    "metadata": 
    {
        "content": "TestingDocument...",
        "content_url": "",
        "source_metadata": 
        {
            "source_name": "data/multimodal_test.pdf",
            "source_id": "data/multimodal_test.pdf",
            "source_location": "",
            "source_type": "PDF",
            "collection_id": "",
            "date_created": "2025-03-13T18:37:14.715892",
            "last_modified": "2025-03-13T18:37:14.715534",
            "summary": "",
            "partition_id": -1,
            "access_level": 1
        },
        "content_metadata": 
        {
            "type": "structured",
            "description": "Structured chart extracted from PDF document.",
            "page_number": 1,
            "hierarchy": 
            {
                "page_count": 3,
                "page": 1,
                "block": -1,
                "line": -1,
                "span": -1,
                "nearby_objects": 
                {
                    "text": 
                    {
                        "content": [],
                        "bbox": [],
                        "type": []
                    },
                    "images": 
                    {
                        "content": [],
                        "bbox": [],
                        "type": []
                    },
                    "structured": 
                    {
                        "content": [],
                        "bbox": [],
                        "type": []
                    }
                }
            },
            "subtype": "chart"
        },
        "audio_metadata": null,
        "text_metadata": null,
        "image_metadata": null,
        "table_metadata": 
        {
            "caption": "",
            "table_format": "image",
            "table_content": "Below,is a high-quality picture of some shapes          Picture",
            "table_content_format": "",
            "table_location": 
            [
                74,
                614,
                728,
                920
            ],
            "table_location_max_dimensions": 
            [
                792,
                1024
            ],
            "uploaded_image_uri": ""
        },
        "chart_metadata": null,
        "error_metadata": null,
        "info_message_metadata": null,
        "debug_metadata": null,
        "raise_on_failure": false
    }
}