nv_ingest_api.util.metadata package#
Submodules#
nv_ingest_api.util.metadata.aggregators module#
- class nv_ingest_api.util.metadata.aggregators.Base64Image(
- image: str,
- bbox: Tuple[int, int, int, int],
- width: int,
- height: int,
- max_width: int,
- max_height: int,
Bases:
object
- bbox: Tuple[int, int, int, int]#
- height: int#
- image: str#
- max_height: int#
- max_width: int#
- width: int#
- class nv_ingest_api.util.metadata.aggregators.CroppedImageWithContent(
- content: str,
- image: str,
- bbox: Tuple[int, int, int, int],
- max_width: int,
- max_height: int,
- type_string: str,
- content_format: str = '',
Bases:
object
- bbox: Tuple[int, int, int, int]#
- content: str#
- content_format: str = ''#
- image: str#
- max_height: int#
- max_width: int#
- type_string: str#
- class nv_ingest_api.util.metadata.aggregators.LatexTable(
- latex: pandas.core.frame.DataFrame,
- bbox: Tuple[int, int, int, int],
- max_width: int,
- max_height: int,
Bases:
object
- bbox: Tuple[int, int, int, int]#
- latex: DataFrame#
- max_height: int#
- max_width: int#
- class nv_ingest_api.util.metadata.aggregators.PDFMetadata(
- page_count: int,
- filename: str,
- last_modified: str,
- date_created: str,
- keywords: List[str],
- source_type: str = 'PDF',
Bases:
object
A data object to store metadata information extracted from a PDF document.
- date_created: str#
- filename: str#
- keywords: List[str]#
- last_modified: str#
- page_count: int#
- source_type: str = 'PDF'#
- nv_ingest_api.util.metadata.aggregators.construct_image_metadata_from_base64(
- base64_image: str,
- page_idx: int,
- page_count: int,
- source_metadata: Dict[str, Any],
- base_unified_metadata: Dict[str, Any],
Extracts image data from a base64-encoded image string, decodes the image to get its dimensions and bounding box, and constructs metadata for the image.
- Parameters:
base64_image (str) – A base64-encoded string representing the image.
page_idx (int) – The index of the current page being processed.
page_count (int) – The total number of pages in the PDF document.
source_metadata (Dict[str, Any]) – Metadata related to the source of the PDF document.
base_unified_metadata (Dict[str, Any]) – The base unified metadata structure to be updated with the extracted image information.
- Returns:
A list containing the content type, validated metadata dictionary, and a UUID string.
- Return type:
List[Any]
- Raises:
ValueError – If the image cannot be decoded from the base64 string.
- nv_ingest_api.util.metadata.aggregators.construct_image_metadata_from_pdf_image(
- pdf_image: PdfImage,
- page_idx: int,
- page_count: int,
- source_metadata: Dict[str, Any],
- base_unified_metadata: Dict[str, Any],
Extracts image data from a PdfImage object, converts it to a base64-encoded string, and constructs metadata for the image.
- Parameters:
image_obj (PdfImage) – The PdfImage object from which the image will be extracted.
page_idx (int) – The index of the current page being processed.
page_count (int) – The total number of pages in the PDF document.
source_metadata (dict) – Metadata related to the source of the PDF document.
base_unified_metadata (dict) – The base unified metadata structure to be updated with the extracted image information.
- Returns:
A list containing the content type, validated metadata dictionary, and a UUID string.
- Return type:
List[Any]
- Raises:
PdfiumError – If the image cannot be extracted due to an issue with the PdfImage object. :param pdf_image:
- nv_ingest_api.util.metadata.aggregators.construct_text_metadata(
- accumulated_text,
- keywords,
- page_idx,
- block_idx,
- line_idx,
- span_idx,
- page_count,
- text_depth,
- source_metadata,
- base_unified_metadata,
- delimiter=' ',
- bbox_max_dimensions: Tuple[int, int] = (-1, -1),
- nearby_objects: Dict[str, Any] | None = None,
- nv_ingest_api.util.metadata.aggregators.extract_pdf_metadata(
- doc: PdfDocument,
- source_id: str,
Extracts metadata and relevant information from a PDF document.
- Parameters:
pdf_stream (bytes) – The PDF document data as a byte stream.
source_id (str) – The identifier for the source document, typically the filename.
- Returns:
An object containing extracted metadata and information including: - page_count: The total number of pages in the PDF. - filename: The source filename or identifier. - last_modified: The last modified date of the PDF document. - date_created: The creation date of the PDF document. - keywords: Keywords associated with the PDF document. - source_type: The type/format of the source, e.g., “PDF”.
- Return type:
- Raises:
PdfiumError – If there is an issue processing the PDF document.