Is this page helpful?

Metadata Schema Documentation

This document provides a detailed explanation of the MetadataSchema and its constituent sub-schemas used within the NVIDIA Ingest Framework. This schema defines the structure for metadata associated with ingested content.

Main Schema: `MetadataSchema`

The MetadataSchema is the primary container for all metadata. It includes the core content, its URL, embedding, and various specialized metadata blocks.

Field	Type	Default Value/Behavior	Description
`content`	`str`	`""`	The actual textual content extracted from the source.
`content_url`	`str`	`""`	URL pointing to the location of the content, if applicable.
`embedding`	`Optional[List[float]]`	`None`	Optional numerical vector representation (embedding) of the content.
`source_metadata`	`Optional[SourceMetadataSchema]`	`None`	Metadata about the original source of the content. See SourceMetadataSchema.
`content_metadata`	`Optional[ContentMetadataSchema]`	`None`	General metadata about the extracted content itself. See ContentMetadataSchema.
`audio_metadata`	`Optional[AudioMetadataSchema]`	`None`	Specific metadata for audio content. Automatically set to `None` if `content_metadata.type` is not `AUDIO`. See AudioMetadataSchema.
`text_metadata`	`Optional[TextMetadataSchema]`	`None`	Specific metadata for text content. Automatically set to `None` if `content_metadata.type` is not `TEXT`. See TextMetadataSchema.
`image_metadata`	`Optional[ImageMetadataSchema]`	`None`	Specific metadata for image content. Automatically set to `None` if `content_metadata.type` is not `IMAGE`. See ImageMetadataSchema.
`table_metadata`	`Optional[TableMetadataSchema]`	`None`	Specific metadata for tabular content. Automatically set to `None` if `content_metadata.type` is not `STRUCTURED`. See TableMetadataSchema.
`chart_metadata`	`Optional[ChartMetadataSchema]`	`None`	Specific metadata for chart content. See ChartMetadataSchema.
`error_metadata`	`Optional[ErrorMetadataSchema]`	`None`	Metadata describing any errors encountered during processing. See ErrorMetadataSchema.
`info_message_metadata`	`Optional[InfoMessageMetadataSchema]`	`None`	Informational messages related to the processing. See InfoMessageMetadataSchema.
`debug_metadata`	`Optional[Dict[str, Any]]`	`None`	A dictionary for storing any arbitrary debug information.
`raise_on_failure`	`bool`	`False`	If `True`, indicates that processing should halt on failure.

Note: A model_validator ensures that type-specific metadata fields (audio_metadata, image_metadata, text_metadata, table_metadata) are set to None if the content_metadata.type does not match the respective content type.

Sub-Schemas

`SourceMetadataSchema`

Describes the origin of the ingested content.

Field	Type	Default Value	Description
`source_name`	`str`	Required	Name of the source (e.g., filename, URL).
`source_id`	`str`	Required	Unique identifier for the source.
`source_location`	`str`	`""`	Physical or logical location of the source (e.g., path, database table).
`source_type`	`Union[DocumentTypeEnum, str]`	Required	Type of the source document (e.g., `pdf`, `docx`, `url`). Uses `DocumentTypeEnum`.
`collection_id`	`str`	`""`	Identifier for any collection this source belongs to.
`date_created`	`str`	`datetime.now().isoformat()`	ISO 8601 timestamp of when the source was created. Validated to be in ISO 8601 format.
`last_modified`	`str`	`datetime.now().isoformat()`	ISO 8601 timestamp of when the source was last modified. Validated to be in ISO 8601 format.
`summary`	`str`	`""`	A brief summary of the source content.
`partition_id`	`int`	`-1`	Identifier for a partition if the source is part of a larger, partitioned dataset.
`access_level`	`Union[AccessLevelEnum, int]`	`AccessLevelEnum.UNKNOWN`	Access level associated with the source. Uses `AccessLevelEnum`.

`ContentMetadataSchema`

General metadata about the extracted content.

Field	Type	Default Value	Description
`type`	`ContentTypeEnum`	Required	The type of the extracted content (e.g., `TEXT`, `IMAGE`, `AUDIO`). Uses `ContentTypeEnum`.
`description`	`str`	`""`	A description of the extracted content.
`page_number`	`int`	`-1`	Page number from which the content was extracted, if applicable (e.g., for PDFs).
`hierarchy`	`ContentHierarchySchema`	`ContentHierarchySchema()`	Hierarchical information about the content's location within the source. See ContentHierarchySchema.
`subtype`	`Union[ContentTypeEnum, str]`	`""`	A more specific subtype for the content (e.g., if `type` is `IMAGE`, `subtype` could be `diagram`).
`start_time`	`int`	`-1`	Start time in milliseconds for time-based media (e.g., audio, video).
`end_time`	`int`	`-1`	End time in milliseconds for time-based media.

`ContentHierarchySchema`

Describes the structural location of content within a document.

Field	Type	Default Value	Description
`page_count`	`int`	`-1`	Total number of pages in the document, if applicable.
`page`	`int`	`-1`	The specific page number where the content resides.
`block`	`int`	`-1`	Identifier for a block of content (e.g., paragraph, section).
`line`	`int`	`-1`	Line number within a block, if applicable.
`span`	`int`	`-1`	Span identifier within a line, for finer granularity.
`nearby_objects`	`NearbyObjectsSchema`	`NearbyObjectsSchema()`	Information about objects (text, images, structured data) near the current content. See NearbyObjectsSchema.

`NearbyObjectsSchema` (Currently Unused)

Container for different types of nearby objects.

Field	Type	Default Value	Description
`text`	`NearbyObjectsSubSchema`	`NearbyObjectsSubSchema()`	Nearby textual objects. See NearbyObjectsSubSchema.
`images`	`NearbyObjectsSubSchema`	`NearbyObjectsSubSchema()`	Nearby image objects.
`structured`	`NearbyObjectsSubSchema`	`NearbyObjectsSubSchema()`	Nearby structured data objects (e.g., tables).

`NearbyObjectsSubSchema`

Describes a list of nearby objects of a specific type.

Field	Type	Default Value	Description
`content`	`List[str]`	`default_factory=list`	List of content strings for the nearby objects.
`bbox`	`List[tuple]`	`default_factory=list`	List of bounding boxes (e.g., coordinates) for the nearby objects.
`type`	`List[str]`	`default_factory=list`	List of types for the nearby objects.

`TextMetadataSchema`

Specific metadata for textual content.

Field	Type	Default Value	Description
`text_type`	`TextTypeEnum`	Required	Type of text (e.g., `document`, `title`, `ocr`). Uses `TextTypeEnum`.
`summary`	`str`	`""`	A summary of this specific text segment.
`keywords`	`Union[str, List[str], Dict]`	`""`	Keywords extracted from or associated with the text. Can be a single string, list of strings, or a dictionary.
`language`	`LanguageEnum`	`"en"`	Detected or specified language of the text. Uses `LanguageEnum`. Defaults to English.
`text_location`	`tuple`	`(0, 0, 0, 0)`	Bounding box or coordinates of the text within its source (e.g., on a page).
`text_location_max_dimensions`	`tuple`	`(0, 0, 0, 0)`	Maximum dimensions of the space where `text_location` is defined (e.g., page width/height).

`ImageMetadataSchema`

Specific metadata for image content.

Field	Type	Default Value	Description
`image_type`	`Union[DocumentTypeEnum, str]`	Required	Type of the image document (e.g., `png`, `jpeg`). Uses `DocumentTypeEnum` or a string.
`structured_image_type`	`ContentTypeEnum`	`ContentTypeEnum.NONE`	If the image represents structured data (e.g., a table or chart), its `ContentTypeEnum`.
`caption`	`str`	`""`	Caption associated with the image.
`text`	`str`	`""`	Text extracted from the image (e.g., via OCR).
`image_location`	`tuple`	`(0, 0, 0, 0)`	Bounding box or coordinates of the image within its source.
`image_location_max_dimensions`	`tuple`	`(0, 0)`	Maximum dimensions of the space where `image_location` is defined.
`uploaded_image_url`	`str`	`""`	URL of the image if it has been uploaded to a separate storage location.
`width`	`int`	`0`	Width of the image in pixels. Clamped to be non-negative.
`height`	`int`	`0`	Height of the image in pixels. Clamped to be non-negative.

`TableMetadataSchema`

Specific metadata for tabular content.

Field	Type	Default Value	Description
`caption`	`str`	`""`	Caption associated with the table.
`table_format`	`TableFormatEnum`	Required	Format of the table (e.g., `csv`, `html`). Uses `TableFormatEnum`.
`table_content`	`str`	`""`	String representation of the table's content (e.g., CSV string, HTML markup).
`table_content_format`	`Union[TableFormatEnum, str]`	`""`	Specific format of `table_content`.
`table_location`	`tuple`	`(0, 0, 0, 0)`	Bounding box or coordinates of the table within its source.
`table_location_max_dimensions`	`tuple`	`(0, 0)`	Maximum dimensions of the space where `table_location` is defined.
`uploaded_image_uri`	`str`	`""`	URI of an image representation of the table, if applicable.

`ChartMetadataSchema`

Specific metadata for chart content. (Currently identical in structure to TableMetadataSchema but semantically distinct).

Field	Type	Default Value	Description
`caption`	`str`	`""`	Caption associated with the chart.
`table_format`	`TableFormatEnum`	Required	Underlying data format of the chart (e.g., data might be in `csv` format). Uses `TableFormatEnum`.
`table_content`	`str`	`""`	String representation of the chart's underlying data.
`table_content_format`	`Union[TableFormatEnum, str]`	`""`	Specific format of `table_content`.
`table_location`	`tuple`	`(0, 0, 0, 0)`	Bounding box or coordinates of the chart within its source.
`table_location_max_dimensions`	`tuple`	`(0, 0)`	Maximum dimensions of the space where `table_location` is defined.
`uploaded_image_uri`	`str`	`""`	URI of an image representation of the chart, if applicable.

`AudioMetadataSchema`

Specific metadata for audio content.

Field	Type	Default Value	Description
`audio_transcript`	`str`	`""`	Transcript of the audio content.
`audio_type`	`str`	`""`	Type or format of the audio (e.g., `mp3`, `wav`).

`ErrorMetadataSchema` (Currently Unused)

Metadata describing errors encountered during processing.

Field	Type	Default Value	Description
`task`	`TaskTypeEnum`	Required	The task that was being performed when the error occurred. Uses `TaskTypeEnum`.
`status`	`StatusEnum`	Required	The status indicating failure. Uses `StatusEnum`.
`source_id`	`str`	`""`	Identifier of the source item that caused the error, if applicable.
`error_msg`	`str`	Required	The error message.

`InfoMessageMetadataSchema` (Currently Unused)

Informational messages related to processing.

Field	Type	Default Value	Description
`task`	`TaskTypeEnum`	Required	The task associated with this informational message. Uses `TaskTypeEnum`.
`status`	`StatusEnum`	Required	The status related to this message (e.g., `INFO`, `WARNING`). Uses `StatusEnum`.
`message`	`str`	Required	The informational message content.
`filter`	`bool`	Required	A flag indicating if this message should be used for filtering purposes.

Enums Used

This schema relies on several enums defined in nv_ingest_api.internal.enums.common:

AccessLevelEnum: Defines access levels (e.g., PUBLIC, CONFIDENTIAL, UNKNOWN).
ContentTypeEnum: Defines types of content (e.g., TEXT, IMAGE, AUDIO, STRUCTURED, NONE).
TextTypeEnum: Defines types of text (e.g., DOCUMENT, TITLE, OCR, CAPTION).
LanguageEnum: Defines languages (e.g., ENGLISH (en), SPANISH (es)).
TableFormatEnum: Defines table formats (e.g., CSV, HTML, TEXT).
StatusEnum: Defines processing statuses (e.g., SUCCESS, FAILURE, PROCESSING, INFO, WARNING).
DocumentTypeEnum: Defines types of source documents (e.g., PDF, DOCX, TXT, URL, PNG, MP3).
TaskTypeEnum: Defines types of processing tasks (e.g., EXTRACTION, EMBEDDING, STORAGE).

Metadata Schema Documentation

Main Schema: MetadataSchema

Sub-Schemas

SourceMetadataSchema

ContentMetadataSchema

ContentHierarchySchema

NearbyObjectsSchema (Currently Unused)

NearbyObjectsSubSchema

TextMetadataSchema

ImageMetadataSchema

TableMetadataSchema

ChartMetadataSchema

AudioMetadataSchema

ErrorMetadataSchema (Currently Unused)

InfoMessageMetadataSchema (Currently Unused)