nv_ingest_api.util.metadata package#

Submodules#

nv_ingest_api.util.metadata.aggregators module#

class nv_ingest_api.util.metadata.aggregators.Base64Image(
image: str,
bbox: Tuple[int, int, int, int],
width: int,
height: int,
max_width: int,
max_height: int,
)[source]#

Bases: object

bbox: Tuple[int, int, int, int]#
height: int#
image: str#
max_height: int#
max_width: int#
width: int#
class nv_ingest_api.util.metadata.aggregators.CroppedImageWithContent(
content: str,
image: str,
bbox: Tuple[int, int, int, int],
max_width: int,
max_height: int,
type_string: str,
content_format: str = '',
)[source]#

Bases: object

bbox: Tuple[int, int, int, int]#
content: str#
content_format: str = ''#
image: str#
max_height: int#
max_width: int#
type_string: str#
class nv_ingest_api.util.metadata.aggregators.LatexTable(
latex: pandas.core.frame.DataFrame,
bbox: Tuple[int, int, int, int],
max_width: int,
max_height: int,
)[source]#

Bases: object

bbox: Tuple[int, int, int, int]#
latex: DataFrame#
max_height: int#
max_width: int#
class nv_ingest_api.util.metadata.aggregators.PDFMetadata(
page_count: int,
filename: str,
last_modified: str,
date_created: str,
keywords: List[str],
source_type: str = 'PDF',
)[source]#

Bases: object

A data object to store metadata information extracted from a PDF document.

date_created: str#
filename: str#
keywords: List[str]#
last_modified: str#
page_count: int#
source_type: str = 'PDF'#
nv_ingest_api.util.metadata.aggregators.construct_image_metadata_from_base64(
base64_image: str,
page_idx: int,
page_count: int,
source_metadata: Dict[str, Any],
base_unified_metadata: Dict[str, Any],
) List[Any][source]#

Extracts image data from a base64-encoded image string, decodes the image to get its dimensions and bounding box, and constructs metadata for the image.

Parameters:
  • base64_image (str) – A base64-encoded string representing the image.

  • page_idx (int) – The index of the current page being processed.

  • page_count (int) – The total number of pages in the PDF document.

  • source_metadata (Dict[str, Any]) – Metadata related to the source of the PDF document.

  • base_unified_metadata (Dict[str, Any]) – The base unified metadata structure to be updated with the extracted image information.

Returns:

A list containing the content type, validated metadata dictionary, and a UUID string.

Return type:

List[Any]

Raises:

ValueError – If the image cannot be decoded from the base64 string.

nv_ingest_api.util.metadata.aggregators.construct_image_metadata_from_pdf_image(
pdf_image: PdfImage,
page_idx: int,
page_count: int,
source_metadata: Dict[str, Any],
base_unified_metadata: Dict[str, Any],
) List[Any][source]#

Extracts image data from a PdfImage object, converts it to a base64-encoded string, and constructs metadata for the image.

Parameters:
  • image_obj (PdfImage) – The PdfImage object from which the image will be extracted.

  • page_idx (int) – The index of the current page being processed.

  • page_count (int) – The total number of pages in the PDF document.

  • source_metadata (dict) – Metadata related to the source of the PDF document.

  • base_unified_metadata (dict) – The base unified metadata structure to be updated with the extracted image information.

Returns:

A list containing the content type, validated metadata dictionary, and a UUID string.

Return type:

List[Any]

Raises:

PdfiumError – If the image cannot be extracted due to an issue with the PdfImage object. :param pdf_image:

nv_ingest_api.util.metadata.aggregators.construct_text_metadata(
accumulated_text,
keywords,
page_idx,
block_idx,
line_idx,
span_idx,
page_count,
text_depth,
source_metadata,
base_unified_metadata,
delimiter=' ',
bbox_max_dimensions: Tuple[int, int] = (-1, -1),
nearby_objects: Dict[str, Any] | None = None,
)[source]#
nv_ingest_api.util.metadata.aggregators.extract_pdf_metadata(
doc: PdfDocument,
source_id: str,
) PDFMetadata[source]#

Extracts metadata and relevant information from a PDF document.

Parameters:
  • pdf_stream (bytes) – The PDF document data as a byte stream.

  • source_id (str) – The identifier for the source document, typically the filename.

Returns:

An object containing extracted metadata and information including: - page_count: The total number of pages in the PDF. - filename: The source filename or identifier. - last_modified: The last modified date of the PDF document. - date_created: The creation date of the PDF document. - keywords: Keywords associated with the PDF document. - source_type: The type/format of the source, e.g., “PDF”.

Return type:

PDFMetadata

Raises:

PdfiumError – If there is an issue processing the PDF document.

Module contents#