nv_ingest_api.util.metadata package#

Submodules#

nv_ingest_api.util.metadata.aggregators module#

class nv_ingest_api.util.metadata.aggregators.Base64Image( image: str, bbox: Tuple[int, int, int, int], width: int, height: int, max_width: int, max_height: int, )[source]#

Bases: object

bbox: Tuple[int, int, int, int]#

height: int#

image: str#

max_height: int#

max_width: int#

width: int#

class nv_ingest_api.util.metadata.aggregators.CroppedImageWithContent( content: str, image: str, bbox: Tuple[int, int, int, int], max_width: int, max_height: int, type_string: str, content_format: str = '', )[source]#

Bases: object

bbox: Tuple[int, int, int, int]#

content: str#

content_format: str = ''#

image: str#

max_height: int#

max_width: int#

type_string: str#

class nv_ingest_api.util.metadata.aggregators.LatexTable( latex: pandas.core.frame.DataFrame, bbox: Tuple[int, int, int, int], max_width: int, max_height: int, )[source]#

Bases: object

bbox: Tuple[int, int, int, int]#

latex: DataFrame#

max_height: int#

max_width: int#

class nv_ingest_api.util.metadata.aggregators.PDFMetadata( page_count: int, filename: str, last_modified: str, date_created: str, keywords: List[str], source_type: str = 'PDF', )[source]#

Bases: object

A data object to store metadata information extracted from a PDF document.

date_created: str#

filename: str#

keywords: List[str]#

last_modified: str#

page_count: int#

source_type: str = 'PDF'#

nv_ingest_api.util.metadata.aggregators.construct_image_metadata_from_base64( base64_image: str, page_idx: int, page_count: int, source_metadata: Dict[str, Any], base_unified_metadata: Dict[str, Any], ) → List[Any][source]#

Extracts image data from a base64-encoded image string, decodes the image to get its dimensions and bounding box, and constructs metadata for the image.

Parameters:

base64_image (str) – A base64-encoded string representing the image.
page_idx (int) – The index of the current page being processed.
page_count (int) – The total number of pages in the PDF document.
source_metadata (Dict[str, Any]) – Metadata related to the source of the PDF document.
base_unified_metadata (Dict[str, Any]) – The base unified metadata structure to be updated with the extracted image information.

Returns:

A list containing the content type, validated metadata dictionary, and a UUID string.

Return type:

List[Any]

Raises:

ValueError – If the image cannot be decoded from the base64 string.

nv_ingest_api.util.metadata.aggregators.construct_image_metadata_from_pdf_image( pdf_image: PdfImage, page_idx: int, page_count: int, source_metadata: Dict[str, Any], base_unified_metadata: Dict[str, Any], ) → List[Any][source]#

Extracts image data from a PdfImage object, converts it to a base64-encoded string, and constructs metadata for the image.

Parameters:

image_obj (PdfImage) – The PdfImage object from which the image will be extracted.
page_idx (int) – The index of the current page being processed.
page_count (int) – The total number of pages in the PDF document.
source_metadata (dict) – Metadata related to the source of the PDF document.
base_unified_metadata (dict) – The base unified metadata structure to be updated with the extracted image information.

Returns:

A list containing the content type, validated metadata dictionary, and a UUID string.

Return type:

List[Any]

Raises:

PdfiumError – If the image cannot be extracted due to an issue with the PdfImage object. :param pdf_image:

nv_ingest_api.util.metadata.aggregators.construct_text_metadata( accumulated_text, keywords, page_idx, block_idx, line_idx, span_idx, page_count, text_depth, source_metadata, base_unified_metadata, delimiter=' ', bbox_max_dimensions: Tuple[int, int] = (-1, -1), nearby_objects: Dict[str, Any] | None = None, )[source]#

nv_ingest_api.util.metadata.aggregators.extract_pdf_metadata( doc: PdfDocument, source_id: str, ) → PDFMetadata[source]#

Extracts metadata and relevant information from a PDF document.

Parameters:

pdf_stream (bytes) – The PDF document data as a byte stream.
source_id (str) – The identifier for the source document, typically the filename.

Returns:

An object containing extracted metadata and information including: - page_count: The total number of pages in the PDF. - filename: The source filename or identifier. - last_modified: The last modified date of the PDF document. - date_created: The creation date of the PDF document. - keywords: Keywords associated with the PDF document. - source_type: The type/format of the source, e.g., “PDF”.

Return type:

PDFMetadata

Raises:

PdfiumError – If there is an issue processing the PDF document.

nv_ingest_api.util.metadata package#

Submodules#

nv_ingest_api.util.metadata.aggregators module#

Module contents#