Metadata Schema Documentation
This document provides a detailed explanation of the MetadataSchema and its constituent sub-schemas used within the NVIDIA Ingest Framework. This schema defines the structure for metadata associated with ingested content.
Main Schema: MetadataSchema
The MetadataSchema is the primary container for all metadata. It includes the core content, its URL, embedding, and various specialized metadata blocks.
| Field | Type | Default Value/Behavior | Description |
|---|---|---|---|
content |
str |
"" |
The actual textual content extracted from the source. |
content_url |
str |
"" |
URL pointing to the location of the content, if applicable. |
embedding |
Optional[List[float]] |
None |
Optional numerical vector representation (embedding) of the content. |
source_metadata |
Optional[SourceMetadataSchema] |
None |
Metadata about the original source of the content. See SourceMetadataSchema. |
content_metadata |
Optional[ContentMetadataSchema] |
None |
General metadata about the extracted content itself. See ContentMetadataSchema. |
audio_metadata |
Optional[AudioMetadataSchema] |
None |
Specific metadata for audio content. Automatically set to None if content_metadata.type is not AUDIO. See AudioMetadataSchema. |
text_metadata |
Optional[TextMetadataSchema] |
None |
Specific metadata for text content. Automatically set to None if content_metadata.type is not TEXT. See TextMetadataSchema. |
image_metadata |
Optional[ImageMetadataSchema] |
None |
Specific metadata for image content. Automatically set to None if content_metadata.type is not IMAGE. See ImageMetadataSchema. |
table_metadata |
Optional[TableMetadataSchema] |
None |
Specific metadata for tabular content. Automatically set to None if content_metadata.type is not STRUCTURED. See TableMetadataSchema. |
chart_metadata |
Optional[ChartMetadataSchema] |
None |
Specific metadata for chart content. See ChartMetadataSchema. |
error_metadata |
Optional[ErrorMetadataSchema] |
None |
Metadata describing any errors encountered during processing. See ErrorMetadataSchema. |
info_message_metadata |
Optional[InfoMessageMetadataSchema] |
None |
Informational messages related to the processing. See InfoMessageMetadataSchema. |
debug_metadata |
Optional[Dict[str, Any]] |
None |
A dictionary for storing any arbitrary debug information. |
raise_on_failure |
bool |
False |
If True, indicates that processing should halt on failure. |
Note: A model_validator ensures that type-specific metadata fields (audio_metadata, image_metadata, text_metadata, table_metadata) are set to None if the content_metadata.type does not match the respective content type.
Sub-Schemas
SourceMetadataSchema
Describes the origin of the ingested content.
| Field | Type | Default Value | Description |
|---|---|---|---|
source_name |
str |
Required | Name of the source (e.g., filename, URL). |
source_id |
str |
Required | Unique identifier for the source. |
source_location |
str |
"" |
Physical or logical location of the source (e.g., path, database table). |
source_type |
Union[DocumentTypeEnum, str] |
Required | Type of the source document (e.g., pdf, docx, url). Uses DocumentTypeEnum. |
collection_id |
str |
"" |
Identifier for any collection this source belongs to. |
date_created |
str |
datetime.now().isoformat() |
ISO 8601 timestamp of when the source was created. Validated to be in ISO 8601 format. |
last_modified |
str |
datetime.now().isoformat() |
ISO 8601 timestamp of when the source was last modified. Validated to be in ISO 8601 format. |
summary |
str |
"" |
A brief summary of the source content. |
partition_id |
int |
-1 |
Identifier for a partition if the source is part of a larger, partitioned dataset. |
access_level |
Union[AccessLevelEnum, int] |
AccessLevelEnum.UNKNOWN |
Access level associated with the source. Uses AccessLevelEnum. |
ContentMetadataSchema
General metadata about the extracted content.
| Field | Type | Default Value | Description |
|---|---|---|---|
type |
ContentTypeEnum |
Required | The type of the extracted content (e.g., TEXT, IMAGE, AUDIO). Uses ContentTypeEnum. |
description |
str |
"" |
A description of the extracted content. |
page_number |
int |
-1 |
Page number from which the content was extracted, if applicable (e.g., for PDFs). |
hierarchy |
ContentHierarchySchema |
ContentHierarchySchema() |
Hierarchical information about the content's location within the source. See ContentHierarchySchema. |
subtype |
Union[ContentTypeEnum, str] |
"" |
A more specific subtype for the content (e.g., if type is IMAGE, subtype could be diagram). |
start_time |
int |
-1 |
Start time in milliseconds for time-based media (e.g., audio, video). |
end_time |
int |
-1 |
End time in milliseconds for time-based media. |
ContentHierarchySchema
Describes the structural location of content within a document.
| Field | Type | Default Value | Description |
|---|---|---|---|
page_count |
int |
-1 |
Total number of pages in the document, if applicable. |
page |
int |
-1 |
The specific page number where the content resides. |
block |
int |
-1 |
Identifier for a block of content (e.g., paragraph, section). |
line |
int |
-1 |
Line number within a block, if applicable. |
span |
int |
-1 |
Span identifier within a line, for finer granularity. |
nearby_objects |
NearbyObjectsSchema |
NearbyObjectsSchema() |
Information about objects (text, images, structured data) near the current content. See NearbyObjectsSchema. |
NearbyObjectsSchema (Currently Unused)
Container for different types of nearby objects.
| Field | Type | Default Value | Description |
|---|---|---|---|
text |
NearbyObjectsSubSchema |
NearbyObjectsSubSchema() |
Nearby textual objects. See NearbyObjectsSubSchema. |
images |
NearbyObjectsSubSchema |
NearbyObjectsSubSchema() |
Nearby image objects. |
structured |
NearbyObjectsSubSchema |
NearbyObjectsSubSchema() |
Nearby structured data objects (e.g., tables). |
NearbyObjectsSubSchema
Describes a list of nearby objects of a specific type.
| Field | Type | Default Value | Description |
|---|---|---|---|
content |
List[str] |
default_factory=list |
List of content strings for the nearby objects. |
bbox |
List[tuple] |
default_factory=list |
List of bounding boxes (e.g., coordinates) for the nearby objects. |
type |
List[str] |
default_factory=list |
List of types for the nearby objects. |
TextMetadataSchema
Specific metadata for textual content.
| Field | Type | Default Value | Description |
|---|---|---|---|
text_type |
TextTypeEnum |
Required | Type of text (e.g., document, title, ocr). Uses TextTypeEnum. |
summary |
str |
"" |
A summary of this specific text segment. |
keywords |
Union[str, List[str], Dict] |
"" |
Keywords extracted from or associated with the text. Can be a single string, list of strings, or a dictionary. |
language |
LanguageEnum |
"en" |
Detected or specified language of the text. Uses LanguageEnum. Defaults to English. |
text_location |
tuple |
(0, 0, 0, 0) |
Bounding box or coordinates of the text within its source (e.g., on a page). |
text_location_max_dimensions |
tuple |
(0, 0, 0, 0) |
Maximum dimensions of the space where text_location is defined (e.g., page width/height). |
ImageMetadataSchema
Specific metadata for image content.
| Field | Type | Default Value | Description |
|---|---|---|---|
image_type |
Union[DocumentTypeEnum, str] |
Required | Type of the image document (e.g., png, jpeg). Uses DocumentTypeEnum or a string. |
structured_image_type |
ContentTypeEnum |
ContentTypeEnum.NONE |
If the image represents structured data (e.g., a table or chart), its ContentTypeEnum. |
caption |
str |
"" |
Caption associated with the image. |
text |
str |
"" |
Text extracted from the image (e.g., via OCR). |
image_location |
tuple |
(0, 0, 0, 0) |
Bounding box or coordinates of the image within its source. |
image_location_max_dimensions |
tuple |
(0, 0) |
Maximum dimensions of the space where image_location is defined. |
uploaded_image_url |
str |
"" |
URL of the image if it has been uploaded to a separate storage location. |
width |
int |
0 |
Width of the image in pixels. Clamped to be non-negative. |
height |
int |
0 |
Height of the image in pixels. Clamped to be non-negative. |
TableMetadataSchema
Specific metadata for tabular content.
| Field | Type | Default Value | Description |
|---|---|---|---|
caption |
str |
"" |
Caption associated with the table. |
table_format |
TableFormatEnum |
Required | Format of the table (e.g., csv, html). Uses TableFormatEnum. |
table_content |
str |
"" |
String representation of the table's content (e.g., CSV string, HTML markup). |
table_content_format |
Union[TableFormatEnum, str] |
"" |
Specific format of table_content. |
table_location |
tuple |
(0, 0, 0, 0) |
Bounding box or coordinates of the table within its source. |
table_location_max_dimensions |
tuple |
(0, 0) |
Maximum dimensions of the space where table_location is defined. |
uploaded_image_uri |
str |
"" |
URI of an image representation of the table, if applicable. |
ChartMetadataSchema
Specific metadata for chart content. (Currently identical in structure to TableMetadataSchema but semantically distinct).
| Field | Type | Default Value | Description |
|---|---|---|---|
caption |
str |
"" |
Caption associated with the chart. |
table_format |
TableFormatEnum |
Required | Underlying data format of the chart (e.g., data might be in csv format). Uses TableFormatEnum. |
table_content |
str |
"" |
String representation of the chart's underlying data. |
table_content_format |
Union[TableFormatEnum, str] |
"" |
Specific format of table_content. |
table_location |
tuple |
(0, 0, 0, 0) |
Bounding box or coordinates of the chart within its source. |
table_location_max_dimensions |
tuple |
(0, 0) |
Maximum dimensions of the space where table_location is defined. |
uploaded_image_uri |
str |
"" |
URI of an image representation of the chart, if applicable. |
AudioMetadataSchema
Specific metadata for audio content.
| Field | Type | Default Value | Description |
|---|---|---|---|
audio_transcript |
str |
"" |
Transcript of the audio content. |
audio_type |
str |
"" |
Type or format of the audio (e.g., mp3, wav). |
ErrorMetadataSchema (Currently Unused)
Metadata describing errors encountered during processing.
| Field | Type | Default Value | Description |
|---|---|---|---|
task |
TaskTypeEnum |
Required | The task that was being performed when the error occurred. Uses TaskTypeEnum. |
status |
StatusEnum |
Required | The status indicating failure. Uses StatusEnum. |
source_id |
str |
"" |
Identifier of the source item that caused the error, if applicable. |
error_msg |
str |
Required | The error message. |
InfoMessageMetadataSchema (Currently Unused)
Informational messages related to processing.
| Field | Type | Default Value | Description |
|---|---|---|---|
task |
TaskTypeEnum |
Required | The task associated with this informational message. Uses TaskTypeEnum. |
status |
StatusEnum |
Required | The status related to this message (e.g., INFO, WARNING). Uses StatusEnum. |
message |
str |
Required | The informational message content. |
filter |
bool |
Required | A flag indicating if this message should be used for filtering purposes. |
Enums Used
This schema relies on several enums defined in nv_ingest_api.internal.enums.common:
AccessLevelEnum: Defines access levels (e.g.,PUBLIC,CONFIDENTIAL,UNKNOWN).ContentTypeEnum: Defines types of content (e.g.,TEXT,IMAGE,AUDIO,STRUCTURED,NONE).TextTypeEnum: Defines types of text (e.g.,DOCUMENT,TITLE,OCR,CAPTION).LanguageEnum: Defines languages (e.g.,ENGLISH(en),SPANISH(es)).TableFormatEnum: Defines table formats (e.g.,CSV,HTML,TEXT).StatusEnum: Defines processing statuses (e.g.,SUCCESS,FAILURE,PROCESSING,INFO,WARNING).DocumentTypeEnum: Defines types of source documents (e.g.,PDF,DOCX,TXT,URL,PNG,MP3).TaskTypeEnum: Defines types of processing tasks (e.g.,EXTRACTION,EMBEDDING,STORAGE).