Is this page helpful?

Metadata Reference for NeMo Retriever Extraction

This documentation contains the reference for the metadata used in NeMo Retriever extraction. The definitions used in this documentation are the following:

Source — The file that is ingested, and from which content and metadata is extracted.
Content — Data extracted from a source, such as text or an image.

Metadata can be extracted from a source or content, or generated by using models, heuristics, or other methods.

Note

NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.

Source File Metadata

The following is the metadata for source files.

Field	Description	Method
Source Name	The name of the source file.	Extracted
Source ID	The ID of the source file.	Extracted
Source location	The URL, URI, or pointer to the storage location of the source file.	—
Source Type	The type of the source file, such as pdf, docx, pptx, or txt.	Extracted
Collection ID	The ID of the collection in which the source is contained.	—
Date Created	The date the source was created.	Extracted
Last Modified	The date the source was last modified.	Extracted
Partition ID	The offset of this data fragment within a larger set of fragments.	Generated
Access Level	The role-based access control for the source.	—
Summary	A summary of the source. (Not yet implemented.)	Generated

Content Metadata

The following is the metadata for content. These fields apply to all content types including text, images, and tables.

Field	Description	Method
Type	The type of the content. Text, Image, Structured, Table, or Chart.	Generated
Subtype	The type of the content for structured data types, such as table or chart.	—
Content	Content extracted from the source.	Extracted
Description	A text description of the content object.	Generated
Page #	The page # of the content in the source. Prior to 26.1.1, this field was 0-indexed. Beginning with 26.1.1, this field is 1-indexed.	Extracted
Hierarchy	The location or order of the content within the source.	Extracted

Text Metadata

The following is the metadata for text.

Field	Description	Method
Text Type	The type of the text, such as header or body.	Extracted
Keywords	Keywords, Named Entities, or other phrases.	Extracted
Language	The language of the content.	Generated
Summary	An abbreviated summary of the content. (Not yet implemented.)	Generated

Image Metadata

The following is the metadata for images.

Field	Description	Method
Image Type	The type of the image, such as structured, natural, hybrid, and others.	Generated (Classifier)
Structured Image Type	The type of the content for structured data types, such as bar chart, pie chart, and others.	Generated (Classifier)
Caption	Any caption or subheading associated with Image	Extracted
Text	Extracted text from a structured chart	Extracted
Image location	Location (x,y) of chart within an image	Extracted
Image location max dimensions	Max dimensions (x_max,y_max) of location (x,y)	Extracted
uploaded_image_uri	Mirrors source_metadata.source_location	—

Table Metadata

The following is the metadata for tables within documents.

Warning

Tables should not be chunked

Field	Description	Method
Table format	Structured (dataframe / lists of rows and columns), or serialized as markdown, html, latex, simple (cells separated as spaces).	Extracted
Table content	Extracted text content, formatted according to table_metadata.table_format.	Extracted
Table location	The bounding box of the table.	Extracted
Table location max dimensions	The max dimensions (x_max,y_max) of the bounding box of the table.	Extracted
Caption	The caption for the table or chart.	Extracted
Title	The title of the table.	Extracted
Subtitle	The subtitle of the table.	Extracted
Axis	Axis information for the table.	Extracted
uploaded_image_uri	A mirror of source_metadata.source_location.	Generated

Metadata Schema Documentation

The following is a detailed explanation of the MetadataSchema and its constituent sub-schemas used within the NVIDIA Ingest Framework. This schema defines the structure for metadata associated with ingested content.

MetadataSchema

The MetadataSchema is the primary container for all metadata. It includes the core content, its URL, embedding, and various specialized metadata blocks.

Field	Type	Default Value/Behavior	Description
`content`	`str`	`""`	The actual textual content extracted from the source.
`content_url`	`str`	`""`	URL pointing to the location of the content, if applicable.
`embedding`	`Optional[List[float]]`	`None`	Optional numerical vector representation (embedding) of the content.
`source_metadata`	`Optional[SourceMetadataSchema]`	`None`	Metadata about the original source of the content. See SourceMetadataSchema.
`content_metadata`	`Optional[ContentMetadataSchema]`	`None`	General metadata about the extracted content itself. See ContentMetadataSchema.
`audio_metadata`	`Optional[AudioMetadataSchema]`	`None`	Specific metadata for audio content. Automatically set to `None` if `content_metadata.type` is not `AUDIO`. See AudioMetadataSchema.
`text_metadata`	`Optional[TextMetadataSchema]`	`None`	Specific metadata for text content. Automatically set to `None` if `content_metadata.type` is not `TEXT`. See TextMetadataSchema.
`image_metadata`	`Optional[ImageMetadataSchema]`	`None`	Specific metadata for image content. Automatically set to `None` if `content_metadata.type` is not `IMAGE`. See ImageMetadataSchema.
`table_metadata`	`Optional[TableMetadataSchema]`	`None`	Specific metadata for tabular content. Automatically set to `None` if `content_metadata.type` is not `STRUCTURED`. See TableMetadataSchema.
`chart_metadata`	`Optional[ChartMetadataSchema]`	`None`	Specific metadata for chart content. See ChartMetadataSchema.
`error_metadata`	`Optional[ErrorMetadataSchema]`	`None`	Metadata describing any errors encountered during processing. See ErrorMetadataSchema.
`info_message_metadata`	`Optional[InfoMessageMetadataSchema]`	`None`	Informational messages related to the processing. See InfoMessageMetadataSchema.
`debug_metadata`	`Optional[Dict[str, Any]]`	`None`	A dictionary for storing any arbitrary debug information.
`raise_on_failure`	`bool`	`False`	If `True`, indicates that processing should halt on failure.

Note: A model_validator ensures that type-specific metadata fields (audio_metadata, image_metadata, text_metadata, table_metadata) are set to None if the content_metadata.type does not match the respective content type.

`SourceMetadataSchema`

Describes the origin of the ingested content.

Field	Type	Default Value	Description
`source_name`	`str`	Required	Name of the source (e.g., filename, URL).
`source_id`	`str`	Required	Unique identifier for the source.
`source_location`	`str`	`""`	Physical or logical location of the source (e.g., path, database table).
`source_type`	`Union[DocumentTypeEnum, str]`	Required	Type of the source document (e.g., `pdf`, `docx`, `url`). Uses `DocumentTypeEnum`.
`collection_id`	`str`	`""`	Identifier for any collection this source belongs to.
`date_created`	`str`	`datetime.now().isoformat()`	ISO 8601 timestamp of when the source was created. Validated to be in ISO 8601 format.
`last_modified`	`str`	`datetime.now().isoformat()`	ISO 8601 timestamp of when the source was last modified. Validated to be in ISO 8601 format.
`summary`	`str`	`""`	A brief summary of the source content.
`partition_id`	`int`	`-1`	Identifier for a partition if the source is part of a larger, partitioned dataset.
`access_level`	`Union[AccessLevelEnum, int]`	`AccessLevelEnum.UNKNOWN`	Access level associated with the source. Uses `AccessLevelEnum`.

`ContentMetadataSchema`

General metadata about the extracted content.

Field	Type	Default Value	Description
`type`	`ContentTypeEnum`	Required	The type of the extracted content (e.g., `TEXT`, `IMAGE`, `AUDIO`). Uses `ContentTypeEnum`.
`description`	`str`	`""`	A description of the extracted content.
`page_number`	`int`	`-1`	Page number from which the content was extracted, if applicable (e.g., for PDFs).
`hierarchy`	`ContentHierarchySchema`	`ContentHierarchySchema()`	Hierarchical information about the content's location within the source. See ContentHierarchySchema.
`subtype`	`Union[ContentTypeEnum, str]`	`""`	A more specific subtype for the content (e.g., if `type` is `IMAGE`, `subtype` could be `diagram`).
`start_time`	`int`	`-1`	Start time in milliseconds for time-based media (e.g., audio, video).
`end_time`	`int`	`-1`	End time in milliseconds for time-based media.

`ContentHierarchySchema`

Describes the structural location of content within a document.

Field	Type	Default Value	Description
`page_count`	`int`	`-1`	Total number of pages in the document, if applicable.
`page`	`int`	`-1`	The specific page number where the content resides.
`block`	`int`	`-1`	Identifier for a block of content (e.g., paragraph, section).
`line`	`int`	`-1`	Line number within a block, if applicable.
`span`	`int`	`-1`	Span identifier within a line, for finer granularity.
`nearby_objects`	`NearbyObjectsSchema`	`NearbyObjectsSchema()`	Information about objects (text, images, structured data) near the current content. See NearbyObjectsSchema.

`NearbyObjectsSchema` (Currently Unused)

Container for different types of nearby objects.

Field	Type	Default Value	Description
`text`	`NearbyObjectsSubSchema`	`NearbyObjectsSubSchema()`	Nearby textual objects. See NearbyObjectsSubSchema.
`images`	`NearbyObjectsSubSchema`	`NearbyObjectsSubSchema()`	Nearby image objects.
`structured`	`NearbyObjectsSubSchema`	`NearbyObjectsSubSchema()`	Nearby structured data objects (e.g., tables).

`NearbyObjectsSubSchema`

Describes a list of nearby objects of a specific type.

Field	Type	Default Value	Description
`content`	`List[str]`	`default_factory=list`	List of content strings for the nearby objects.
`bbox`	`List[tuple]`	`default_factory=list`	List of bounding boxes (e.g., coordinates) for the nearby objects.
`type`	`List[str]`	`default_factory=list`	List of types for the nearby objects.

`TextMetadataSchema`

Specific metadata for textual content.

Field	Type	Default Value	Description
`text_type`	`TextTypeEnum`	Required	Type of text (e.g., `document`, `title`, `ocr`). Uses `TextTypeEnum`.
`summary`	`str`	`""`	A summary of this specific text segment.
`keywords`	`Union[str, List[str], Dict]`	`""`	Keywords extracted from or associated with the text. Can be a single string, list of strings, or a dictionary.
`language`	`LanguageEnum`	`"en"`	Detected or specified language of the text. Uses `LanguageEnum`. Defaults to English.
`text_location`	`tuple`	`(0, 0, 0, 0)`	Bounding box or coordinates of the text within its source (e.g., on a page).
`text_location_max_dimensions`	`tuple`	`(0, 0, 0, 0)`	Maximum dimensions of the space where `text_location` is defined (e.g., page width/height).

`ImageMetadataSchema`

Specific metadata for image content.

Field	Type	Default Value	Description
`image_type`	`Union[DocumentTypeEnum, str]`	Required	Type of the image document (e.g., `png`, `jpeg`). Uses `DocumentTypeEnum` or a string.
`structured_image_type`	`ContentTypeEnum`	`ContentTypeEnum.NONE`	If the image represents structured data (e.g., a table or chart), its `ContentTypeEnum`.
`caption`	`str`	`""`	Caption associated with the image.
`text`	`str`	`""`	Text extracted from the image (e.g., via OCR).
`image_location`	`tuple`	`(0, 0, 0, 0)`	Bounding box or coordinates of the image within its source.
`image_location_max_dimensions`	`tuple`	`(0, 0)`	Maximum dimensions of the space where `image_location` is defined.
`uploaded_image_url`	`str`	`""`	URL of the image if it has been uploaded to a separate storage location.
`width`	`int`	`0`	Width of the image in pixels. Clamped to be non-negative.
`height`	`int`	`0`	Height of the image in pixels. Clamped to be non-negative.

`TableMetadataSchema`

Specific metadata for tabular content.

Field	Type	Default Value	Description
`caption`	`str`	`""`	Caption associated with the table.
`table_format`	`TableFormatEnum`	Required	Format of the table (e.g., `csv`, `html`). Uses `TableFormatEnum`.
`table_content`	`str`	`""`	String representation of the table's content (e.g., CSV string, HTML markup).
`table_content_format`	`Union[TableFormatEnum, str]`	`""`	Specific format of `table_content`.
`table_location`	`tuple`	`(0, 0, 0, 0)`	Bounding box or coordinates of the table within its source.
`table_location_max_dimensions`	`tuple`	`(0, 0)`	Maximum dimensions of the space where `table_location` is defined.
`uploaded_image_uri`	`str`	`""`	URI of an image representation of the table, if applicable.

`ChartMetadataSchema`

Metadata for table content extracted from charts.

Field	Type	Default Value	Description
`caption`	`str`	`""`	Caption associated with the chart.
`table_format`	`TableFormatEnum`	Required	Underlying data format of the chart (e.g., data might be in `csv` format). Uses `TableFormatEnum`.
`table_content`	`str`	`""`	String representation of the chart's underlying data.
`table_content_format`	`Union[TableFormatEnum, str]`	`""`	Specific format of `table_content`.
`table_location`	`tuple`	`(0, 0, 0, 0)`	Bounding box or coordinates of the chart within its source.
`table_location_max_dimensions`	`tuple`	`(0, 0)`	Maximum dimensions of the space where `table_location` is defined.
`uploaded_image_uri`	`str`	`""`	URI of an image representation of the chart, if applicable.

`AudioMetadataSchema`

Specific metadata for audio content.

Field	Type	Default Value	Description
`audio_transcript`	`str`	`""`	Transcript of the audio content.
`audio_type`	`str`	`""`	Type or format of the audio (e.g., `mp3`, `wav`).

`ErrorMetadataSchema` (Currently Unused)

Metadata describing errors encountered during processing.

Field	Type	Default Value	Description
`task`	`TaskTypeEnum`	Required	The task that was being performed when the error occurred. Uses `TaskTypeEnum`.
`status`	`StatusEnum`	Required	The status indicating failure. Uses `StatusEnum`.
`source_id`	`str`	`""`	Identifier of the source item that caused the error, if applicable.
`error_msg`	`str`	Required	The error message.

`InfoMessageMetadataSchema` (Currently Unused)

Informational messages related to processing.

Field	Type	Default Value	Description
`task`	`TaskTypeEnum`	Required	The task associated with this informational message. Uses `TaskTypeEnum`.
`status`	`StatusEnum`	Required	The status related to this message (e.g., `INFO`, `WARNING`). Uses `StatusEnum`.
`message`	`str`	Required	The informational message content.
`filter`	`bool`	Required	A flag indicating if this message should be used for filtering purposes.

Enums

The following enums are used by this schema:

AccessLevelEnum – Defines access levels (e.g., PUBLIC, CONFIDENTIAL, UNKNOWN).
ContentTypeEnum – Defines types of content (e.g., TEXT, IMAGE, AUDIO, STRUCTURED, NONE).
TextTypeEnum – Defines types of text (e.g., DOCUMENT, TITLE, OCR, CAPTION).
LanguageEnum – Defines languages (e.g., ENGLISH (en), SPANISH (es)).
TableFormatEnum – Defines table formats (e.g., CSV, HTML, TEXT).
StatusEnum – Defines processing statuses (e.g., SUCCESS, FAILURE, PROCESSING, INFO, WARNING).
DocumentTypeEnum – Defines types of source documents (e.g., PDF, DOCX, TXT, URL, PNG, MP3).
TaskTypeEnum – Defines types of processing tasks (e.g., EXTRACTION, EMBEDDING, STORAGE).

Example Metadata

The following is an example JSON representation of metadata. This is an example only, and does not contain the full metadata. For the full file, refer to the data folder.

{
    "document_type": "text",
    "metadata": 
    {
        "content": "TestingDocument...",
        "content_url": "",
        "source_metadata": 
        {
            "source_name": "data/multimodal_test.pdf",
            "source_id": "data/multimodal_test.pdf",
            "source_location": "",
            "source_type": "PDF",
            "collection_id": "",
            "date_created": "2025-03-13T18:37:14.715892",
            "last_modified": "2025-03-13T18:37:14.715534",
            "summary": "",
            "partition_id": -1,
            "access_level": 1
        },
        "content_metadata": 
        {
            "type": "structured",
            "description": "Structured chart extracted from PDF document.",
            "page_number": 1,
            "hierarchy": 
            {
                "page_count": 3,
                "page": 1,
                "block": -1,
                "line": -1,
                "span": -1,
                "nearby_objects": 
                {
                    "text": 
                    {
                        "content": [],
                        "bbox": [],
                        "type": []
                    },
                    "images": 
                    {
                        "content": [],
                        "bbox": [],
                        "type": []
                    },
                    "structured": 
                    {
                        "content": [],
                        "bbox": [],
                        "type": []
                    }
                }
            },
            "subtype": "chart"
        },
        "audio_metadata": null,
        "text_metadata": null,
        "image_metadata": null,
        "table_metadata": 
        {
            "caption": "",
            "table_format": "image",
            "table_content": "Below,is a high-quality picture of some shapes          Picture",
            "table_content_format": "",
            "table_location": 
            [
                74,
                614,
                728,
                920
            ],
            "table_location_max_dimensions": 
            [
                792,
                1024
            ],
            "uploaded_image_uri": ""
        },
        "chart_metadata": null,
        "error_metadata": null,
        "info_message_metadata": null,
        "debug_metadata": null,
        "raise_on_failure": false
    }
}

Environment Variables

Metadata Reference for NeMo Retriever Extraction

Source File Metadata

Content Metadata

Text Metadata

Image Metadata

Table Metadata

Metadata Schema Documentation

MetadataSchema

SourceMetadataSchema

ContentMetadataSchema

ContentHierarchySchema

NearbyObjectsSchema (Currently Unused)

NearbyObjectsSubSchema

TextMetadataSchema

ImageMetadataSchema

TableMetadataSchema

ChartMetadataSchema

AudioMetadataSchema

ErrorMetadataSchema (Currently Unused)

InfoMessageMetadataSchema (Currently Unused)