nv_ingest_api.internal.extract.pdf package#

Subpackages#

Submodules#

nv_ingest_api.internal.extract.pdf.pdf_extractor module#

nv_ingest_api.internal.extract.pdf.pdf_extractor.extract_primitives_from_pdf_internal(
df_extraction_ledger: DataFrame,
task_config: Dict[str, Any],
extractor_config: Any,
execution_trace_log: List[Any] | None = None,
) Tuple[DataFrame, Dict][source]#

Process a DataFrame of PDF documents by orchestrating extraction for each row.

This function applies the row-level orchestration function to every row in the DataFrame, aggregates the results, and returns a new DataFrame with the extracted data along with any trace information.

Parameters:
  • df_extraction_ledger (pd.DataFrame) – A pandas DataFrame containing PDF documents. Must include a ‘content’ column with base64-encoded PDF data.

  • task_config (dict) – A dictionary of configuration parameters. Expected to include ‘task_properties’ and ‘validated_config’ keys.

  • extractor_config (Any) – A dictionary of configuration parameters for the extraction process.

  • execution_trace_log (list, optional) – A list for accumulating trace information during extraction. Defaults to None.

Returns:

A tuple where the first element is a DataFrame with the extracted data (with columns: document_type, metadata, uuid) and the second element is a dictionary containing trace information.

Return type:

tuple of (pd.DataFrame, dict)

Raises:

Exception – If an error occurs during the extraction process on any row.

Module contents#