Metadata Reference for NeMo Retriever Extraction
This documentation contains the reference for the metadata used in NeMo Retriever extraction. The definitions used in this documentation are the following:
- Source — The file that is ingested, and from which content and metadata is extracted.
- Content — Data extracted from a source, such as text or an image.
Metadata can be extracted from a source or content, or generated by using models, heuristics, or other methods.
Note
NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.
Source File Metadata
The following is the metadata for source files.
| Field | Description | Method |
|---|---|---|
| Source Name | The name of the source file. | Extracted |
| Source ID | The ID of the source file. | Extracted |
| Source location | The URL, URI, or pointer to the storage location of the source file. | — |
| Source Type | The type of the source file, such as pdf, docx, pptx, or txt. | Extracted |
| Collection ID | The ID of the collection in which the source is contained. | — |
| Date Created | The date the source was created. | Extracted |
| Last Modified | The date the source was last modified. | Extracted |
| Partition ID | The offset of this data fragment within a larger set of fragments. | Generated |
| Access Level | The role-based access control for the source. | — |
| Summary | A summary of the source. (Not yet implemented.) | Generated |
Content Metadata
The following is the metadata for content. These fields apply to all content types including text, images, and tables.
| Field | Description | Method |
|---|---|---|
| Type | The type of the content. Text, Image, Structured, Table, or Chart. | Generated |
| Subtype | The type of the content for structured data types, such as table or chart. | — |
| Content | Content extracted from the source. | Extracted |
| Description | A text description of the content object. | Generated |
| Page # | The page # of the content in the source. Prior to 26.1.1, this field was 0-indexed. Beginning with 26.1.1, this field is 1-indexed. | Extracted |
| Hierarchy | The location or order of the content within the source. | Extracted |
Text Metadata
The following is the metadata for text.
| Field | Description | Method |
|---|---|---|
| Text Type | The type of the text, such as header or body. | Extracted |
| Keywords | Keywords, Named Entities, or other phrases. | Extracted |
| Language | The language of the content. | Generated |
| Summary | An abbreviated summary of the content. (Not yet implemented.) | Generated |
Image Metadata
The following is the metadata for images.
| Field | Description | Method |
|---|---|---|
| Image Type | The type of the image, such as structured, natural, hybrid, and others. | Generated (Classifier) |
| Structured Image Type | The type of the content for structured data types, such as bar chart, pie chart, and others. | Generated (Classifier) |
| Caption | Any caption or subheading associated with Image | Extracted |
| Text | Extracted text from a structured chart | Extracted |
| Image location | Location (x,y) of chart within an image | Extracted |
| Image location max dimensions | Max dimensions (x_max,y_max) of location (x,y) | Extracted |
| uploaded_image_uri | Mirrors source_metadata.source_location | — |
Table Metadata
The following is the metadata for tables within documents.
Warning
Tables should not be chunked
| Field | Description | Method |
|---|---|---|
| Table format | Structured (dataframe / lists of rows and columns), or serialized as markdown, html, latex, simple (cells separated as spaces). | Extracted |
| Table content | Extracted text content, formatted according to table_metadata.table_format. | Extracted |
| Table location | The bounding box of the table. | Extracted |
| Table location max dimensions | The max dimensions (x_max,y_max) of the bounding box of the table. | Extracted |
| Caption | The caption for the table or chart. | Extracted |
| Title | The title of the table. | Extracted |
| Subtitle | The subtitle of the table. | Extracted |
| Axis | Axis information for the table. | Extracted |
| uploaded_image_uri | A mirror of source_metadata.source_location. | Generated |
Metadata Schema Documentation
The following is a detailed explanation of the MetadataSchema and its constituent sub-schemas used within the NVIDIA Ingest Framework. This schema defines the structure for metadata associated with ingested content.
MetadataSchema
The MetadataSchema is the primary container for all metadata. It includes the core content, its URL, embedding, and various specialized metadata blocks.
| Field | Type | Default Value/Behavior | Description |
|---|---|---|---|
content |
str |
"" |
The actual textual content extracted from the source. |
content_url |
str |
"" |
URL pointing to the location of the content, if applicable. |
embedding |
Optional[List[float]] |
None |
Optional numerical vector representation (embedding) of the content. |
source_metadata |
Optional[SourceMetadataSchema] |
None |
Metadata about the original source of the content. See SourceMetadataSchema. |
content_metadata |
Optional[ContentMetadataSchema] |
None |
General metadata about the extracted content itself. See ContentMetadataSchema. |
audio_metadata |
Optional[AudioMetadataSchema] |
None |
Specific metadata for audio content. Automatically set to None if content_metadata.type is not AUDIO. See AudioMetadataSchema. |
text_metadata |
Optional[TextMetadataSchema] |
None |
Specific metadata for text content. Automatically set to None if content_metadata.type is not TEXT. See TextMetadataSchema. |
image_metadata |
Optional[ImageMetadataSchema] |
None |
Specific metadata for image content. Automatically set to None if content_metadata.type is not IMAGE. See ImageMetadataSchema. |
table_metadata |
Optional[TableMetadataSchema] |
None |
Specific metadata for tabular content. Automatically set to None if content_metadata.type is not STRUCTURED. See TableMetadataSchema. |
chart_metadata |
Optional[ChartMetadataSchema] |
None |
Specific metadata for chart content. See ChartMetadataSchema. |
error_metadata |
Optional[ErrorMetadataSchema] |
None |
Metadata describing any errors encountered during processing. See ErrorMetadataSchema. |
info_message_metadata |
Optional[InfoMessageMetadataSchema] |
None |
Informational messages related to the processing. See InfoMessageMetadataSchema. |
debug_metadata |
Optional[Dict[str, Any]] |
None |
A dictionary for storing any arbitrary debug information. |
raise_on_failure |
bool |
False |
If True, indicates that processing should halt on failure. |
Note: A model_validator ensures that type-specific metadata fields (audio_metadata, image_metadata, text_metadata, table_metadata) are set to None if the content_metadata.type does not match the respective content type.
SourceMetadataSchema
Describes the origin of the ingested content.
| Field | Type | Default Value | Description |
|---|---|---|---|
source_name |
str |
Required | Name of the source (e.g., filename, URL). |
source_id |
str |
Required | Unique identifier for the source. |
source_location |
str |
"" |
Physical or logical location of the source (e.g., path, database table). |
source_type |
Union[DocumentTypeEnum, str] |
Required | Type of the source document (e.g., pdf, docx, url). Uses DocumentTypeEnum. |
collection_id |
str |
"" |
Identifier for any collection this source belongs to. |
date_created |
str |
datetime.now().isoformat() |
ISO 8601 timestamp of when the source was created. Validated to be in ISO 8601 format. |
last_modified |
str |
datetime.now().isoformat() |
ISO 8601 timestamp of when the source was last modified. Validated to be in ISO 8601 format. |
summary |
str |
"" |
A brief summary of the source content. |
partition_id |
int |
-1 |
Identifier for a partition if the source is part of a larger, partitioned dataset. |
access_level |
Union[AccessLevelEnum, int] |
AccessLevelEnum.UNKNOWN |
Access level associated with the source. Uses AccessLevelEnum. |
ContentMetadataSchema
General metadata about the extracted content.
| Field | Type | Default Value | Description |
|---|---|---|---|
type |
ContentTypeEnum |
Required | The type of the extracted content (e.g., TEXT, IMAGE, AUDIO). Uses ContentTypeEnum. |
description |
str |
"" |
A description of the extracted content. |
page_number |
int |
-1 |
Page number from which the content was extracted, if applicable (e.g., for PDFs). |
hierarchy |
ContentHierarchySchema |
ContentHierarchySchema() |
Hierarchical information about the content's location within the source. See ContentHierarchySchema. |
subtype |
Union[ContentTypeEnum, str] |
"" |
A more specific subtype for the content (e.g., if type is IMAGE, subtype could be diagram). |
start_time |
int |
-1 |
Start time in milliseconds for time-based media (e.g., audio, video). |
end_time |
int |
-1 |
End time in milliseconds for time-based media. |
ContentHierarchySchema
Describes the structural location of content within a document.
| Field | Type | Default Value | Description |
|---|---|---|---|
page_count |
int |
-1 |
Total number of pages in the document, if applicable. |
page |
int |
-1 |
The specific page number where the content resides. |
block |
int |
-1 |
Identifier for a block of content (e.g., paragraph, section). |
line |
int |
-1 |
Line number within a block, if applicable. |
span |
int |
-1 |
Span identifier within a line, for finer granularity. |
nearby_objects |
NearbyObjectsSchema |
NearbyObjectsSchema() |
Information about objects (text, images, structured data) near the current content. See NearbyObjectsSchema. |
NearbyObjectsSchema (Currently Unused)
Container for different types of nearby objects.
| Field | Type | Default Value | Description |
|---|---|---|---|
text |
NearbyObjectsSubSchema |
NearbyObjectsSubSchema() |
Nearby textual objects. See NearbyObjectsSubSchema. |
images |
NearbyObjectsSubSchema |
NearbyObjectsSubSchema() |
Nearby image objects. |
structured |
NearbyObjectsSubSchema |
NearbyObjectsSubSchema() |
Nearby structured data objects (e.g., tables). |
NearbyObjectsSubSchema
Describes a list of nearby objects of a specific type.
| Field | Type | Default Value | Description |
|---|---|---|---|
content |
List[str] |
default_factory=list |
List of content strings for the nearby objects. |
bbox |
List[tuple] |
default_factory=list |
List of bounding boxes (e.g., coordinates) for the nearby objects. |
type |
List[str] |
default_factory=list |
List of types for the nearby objects. |
TextMetadataSchema
Specific metadata for textual content.
| Field | Type | Default Value | Description |
|---|---|---|---|
text_type |
TextTypeEnum |
Required | Type of text (e.g., document, title, ocr). Uses TextTypeEnum. |
summary |
str |
"" |
A summary of this specific text segment. |
keywords |
Union[str, List[str], Dict] |
"" |
Keywords extracted from or associated with the text. Can be a single string, list of strings, or a dictionary. |
language |
LanguageEnum |
"en" |
Detected or specified language of the text. Uses LanguageEnum. Defaults to English. |
text_location |
tuple |
(0, 0, 0, 0) |
Bounding box or coordinates of the text within its source (e.g., on a page). |
text_location_max_dimensions |
tuple |
(0, 0, 0, 0) |
Maximum dimensions of the space where text_location is defined (e.g., page width/height). |
ImageMetadataSchema
Specific metadata for image content.
| Field | Type | Default Value | Description |
|---|---|---|---|
image_type |
Union[DocumentTypeEnum, str] |
Required | Type of the image document (e.g., png, jpeg). Uses DocumentTypeEnum or a string. |
structured_image_type |
ContentTypeEnum |
ContentTypeEnum.NONE |
If the image represents structured data (e.g., a table or chart), its ContentTypeEnum. |
caption |
str |
"" |
Caption associated with the image. |
text |
str |
"" |
Text extracted from the image (e.g., via OCR). |
image_location |
tuple |
(0, 0, 0, 0) |
Bounding box or coordinates of the image within its source. |
image_location_max_dimensions |
tuple |
(0, 0) |
Maximum dimensions of the space where image_location is defined. |
uploaded_image_url |
str |
"" |
URL of the image if it has been uploaded to a separate storage location. |
width |
int |
0 |
Width of the image in pixels. Clamped to be non-negative. |
height |
int |
0 |
Height of the image in pixels. Clamped to be non-negative. |
TableMetadataSchema
Specific metadata for tabular content.
| Field | Type | Default Value | Description |
|---|---|---|---|
caption |
str |
"" |
Caption associated with the table. |
table_format |
TableFormatEnum |
Required | Format of the table (e.g., csv, html). Uses TableFormatEnum. |
table_content |
str |
"" |
String representation of the table's content (e.g., CSV string, HTML markup). |
table_content_format |
Union[TableFormatEnum, str] |
"" |
Specific format of table_content. |
table_location |
tuple |
(0, 0, 0, 0) |
Bounding box or coordinates of the table within its source. |
table_location_max_dimensions |
tuple |
(0, 0) |
Maximum dimensions of the space where table_location is defined. |
uploaded_image_uri |
str |
"" |
URI of an image representation of the table, if applicable. |
ChartMetadataSchema
Metadata for table content extracted from charts.
| Field | Type | Default Value | Description |
|---|---|---|---|
caption |
str |
"" |
Caption associated with the chart. |
table_format |
TableFormatEnum |
Required | Underlying data format of the chart (e.g., data might be in csv format). Uses TableFormatEnum. |
table_content |
str |
"" |
String representation of the chart's underlying data. |
table_content_format |
Union[TableFormatEnum, str] |
"" |
Specific format of table_content. |
table_location |
tuple |
(0, 0, 0, 0) |
Bounding box or coordinates of the chart within its source. |
table_location_max_dimensions |
tuple |
(0, 0) |
Maximum dimensions of the space where table_location is defined. |
uploaded_image_uri |
str |
"" |
URI of an image representation of the chart, if applicable. |
AudioMetadataSchema
Specific metadata for audio content.
| Field | Type | Default Value | Description |
|---|---|---|---|
audio_transcript |
str |
"" |
Transcript of the audio content. |
audio_type |
str |
"" |
Type or format of the audio (e.g., mp3, wav). |
ErrorMetadataSchema (Currently Unused)
Metadata describing errors encountered during processing.
| Field | Type | Default Value | Description |
|---|---|---|---|
task |
TaskTypeEnum |
Required | The task that was being performed when the error occurred. Uses TaskTypeEnum. |
status |
StatusEnum |
Required | The status indicating failure. Uses StatusEnum. |
source_id |
str |
"" |
Identifier of the source item that caused the error, if applicable. |
error_msg |
str |
Required | The error message. |
InfoMessageMetadataSchema (Currently Unused)
Informational messages related to processing.
| Field | Type | Default Value | Description |
|---|---|---|---|
task |
TaskTypeEnum |
Required | The task associated with this informational message. Uses TaskTypeEnum. |
status |
StatusEnum |
Required | The status related to this message (e.g., INFO, WARNING). Uses StatusEnum. |
message |
str |
Required | The informational message content. |
filter |
bool |
Required | A flag indicating if this message should be used for filtering purposes. |
Enums
The following enums are used by this schema:
AccessLevelEnum– Defines access levels (e.g.,PUBLIC,CONFIDENTIAL,UNKNOWN).ContentTypeEnum– Defines types of content (e.g.,TEXT,IMAGE,AUDIO,STRUCTURED,NONE).TextTypeEnum– Defines types of text (e.g.,DOCUMENT,TITLE,OCR,CAPTION).LanguageEnum– Defines languages (e.g.,ENGLISH(en),SPANISH(es)).TableFormatEnum– Defines table formats (e.g.,CSV,HTML,TEXT).StatusEnum– Defines processing statuses (e.g.,SUCCESS,FAILURE,PROCESSING,INFO,WARNING).DocumentTypeEnum– Defines types of source documents (e.g.,PDF,DOCX,TXT,URL,PNG,MP3).TaskTypeEnum– Defines types of processing tasks (e.g.,EXTRACTION,EMBEDDING,STORAGE).
Example Metadata
The following is an example JSON representation of metadata. This is an example only, and does not contain the full metadata. For the full file, refer to the data folder.
{
"document_type": "text",
"metadata":
{
"content": "TestingDocument...",
"content_url": "",
"source_metadata":
{
"source_name": "data/multimodal_test.pdf",
"source_id": "data/multimodal_test.pdf",
"source_location": "",
"source_type": "PDF",
"collection_id": "",
"date_created": "2025-03-13T18:37:14.715892",
"last_modified": "2025-03-13T18:37:14.715534",
"summary": "",
"partition_id": -1,
"access_level": 1
},
"content_metadata":
{
"type": "structured",
"description": "Structured chart extracted from PDF document.",
"page_number": 1,
"hierarchy":
{
"page_count": 3,
"page": 1,
"block": -1,
"line": -1,
"span": -1,
"nearby_objects":
{
"text":
{
"content": [],
"bbox": [],
"type": []
},
"images":
{
"content": [],
"bbox": [],
"type": []
},
"structured":
{
"content": [],
"bbox": [],
"type": []
}
}
},
"subtype": "chart"
},
"audio_metadata": null,
"text_metadata": null,
"image_metadata": null,
"table_metadata":
{
"caption": "",
"table_format": "image",
"table_content": "Below,is a high-quality picture of some shapes Picture",
"table_content_format": "",
"table_location":
[
74,
614,
728,
920
],
"table_location_max_dimensions":
[
792,
1024
],
"uploaded_image_uri": ""
},
"chart_metadata": null,
"error_metadata": null,
"info_message_metadata": null,
"debug_metadata": null,
"raise_on_failure": false
}
}