nv_ingest_api.util.metadata package#
Submodules#
nv_ingest_api.util.metadata.aggregators module#
- class nv_ingest_api.util.metadata.aggregators.Base64Image(
- image: str,
- bbox: Tuple[int, int, int, int],
- width: int,
- height: int,
- max_width: int,
- max_height: int,
- Bases: - object- bbox: Tuple[int, int, int, int]#
 - height: int#
 - image: str#
 - max_height: int#
 - max_width: int#
 - width: int#
 
- class nv_ingest_api.util.metadata.aggregators.CroppedImageWithContent(
- content: str,
- image: str,
- bbox: Tuple[int, int, int, int],
- max_width: int,
- max_height: int,
- type_string: str,
- content_format: str = '',
- Bases: - object- bbox: Tuple[int, int, int, int]#
 - content: str#
 - content_format: str = ''#
 - image: str#
 - max_height: int#
 - max_width: int#
 - type_string: str#
 
- class nv_ingest_api.util.metadata.aggregators.LatexTable(
- latex: pandas.core.frame.DataFrame,
- bbox: Tuple[int, int, int, int],
- max_width: int,
- max_height: int,
- Bases: - object- bbox: Tuple[int, int, int, int]#
 - latex: DataFrame#
 - max_height: int#
 - max_width: int#
 
- class nv_ingest_api.util.metadata.aggregators.PDFMetadata(
- page_count: int,
- filename: str,
- last_modified: str,
- date_created: str,
- keywords: List[str],
- source_type: str = 'PDF',
- Bases: - object- A data object to store metadata information extracted from a PDF document. - date_created: str#
 - filename: str#
 - keywords: List[str]#
 - last_modified: str#
 - page_count: int#
 - source_type: str = 'PDF'#
 
- nv_ingest_api.util.metadata.aggregators.construct_image_metadata_from_base64(
- base64_image: str,
- page_idx: int,
- page_count: int,
- source_metadata: Dict[str, Any],
- base_unified_metadata: Dict[str, Any],
- Extracts image data from a base64-encoded image string, decodes the image to get its dimensions and bounding box, and constructs metadata for the image. - Parameters:
- base64_image (str) – A base64-encoded string representing the image. 
- page_idx (int) – The index of the current page being processed. 
- page_count (int) – The total number of pages in the PDF document. 
- source_metadata (Dict[str, Any]) – Metadata related to the source of the PDF document. 
- base_unified_metadata (Dict[str, Any]) – The base unified metadata structure to be updated with the extracted image information. 
 
- Returns:
- A list containing the content type, validated metadata dictionary, and a UUID string. 
- Return type:
- List[Any] 
- Raises:
- ValueError – If the image cannot be decoded from the base64 string. 
 
- nv_ingest_api.util.metadata.aggregators.construct_image_metadata_from_pdf_image(
- pdf_image: PdfImage,
- page_idx: int,
- page_count: int,
- source_metadata: Dict[str, Any],
- base_unified_metadata: Dict[str, Any],
- Extracts image data from a PdfImage object, converts it to a base64-encoded string, and constructs metadata for the image. - Parameters:
- image_obj (PdfImage) – The PdfImage object from which the image will be extracted. 
- page_idx (int) – The index of the current page being processed. 
- page_count (int) – The total number of pages in the PDF document. 
- source_metadata (dict) – Metadata related to the source of the PDF document. 
- base_unified_metadata (dict) – The base unified metadata structure to be updated with the extracted image information. 
 
- Returns:
- A list containing the content type, validated metadata dictionary, and a UUID string. 
- Return type:
- List[Any] 
- Raises:
- PdfiumError – If the image cannot be extracted due to an issue with the PdfImage object. :param pdf_image: 
 
- nv_ingest_api.util.metadata.aggregators.construct_text_metadata(
- accumulated_text,
- keywords,
- page_idx,
- block_idx,
- line_idx,
- span_idx,
- page_count,
- text_depth,
- source_metadata,
- base_unified_metadata,
- delimiter=' ',
- bbox_max_dimensions: Tuple[int, int] = (-1, -1),
- nearby_objects: Dict[str, Any] | None = None,
- nv_ingest_api.util.metadata.aggregators.extract_pdf_metadata(
- doc: PdfDocument,
- source_id: str,
- Extracts metadata and relevant information from a PDF document. - Parameters:
- pdf_stream (bytes) – The PDF document data as a byte stream. 
- source_id (str) – The identifier for the source document, typically the filename. 
 
- Returns:
- An object containing extracted metadata and information including: - page_count: The total number of pages in the PDF. - filename: The source filename or identifier. - last_modified: The last modified date of the PDF document. - date_created: The creation date of the PDF document. - keywords: Keywords associated with the PDF document. - source_type: The type/format of the source, e.g., “PDF”. 
- Return type:
- Raises:
- PdfiumError – If there is an issue processing the PDF document.