nv_ingest_api.internal.extract.pdf package#

Subpackages#

nv_ingest_api.internal.extract.pdf.engines package

Submodules#

nv_ingest_api.internal.extract.pdf.pdf_extractor module#

nv_ingest_api.internal.extract.pdf.pdf_extractor.extract_primitives_from_pdf_internal( df_extraction_ledger: DataFrame, task_config: Dict[str, Any], extractor_config: Any, execution_trace_log: List[Any] | None = None, ) → Tuple[DataFrame, Dict][source]#

Process a DataFrame of PDF documents by orchestrating extraction for each row.

This function applies the row-level orchestration function to every row in the DataFrame, aggregates the results, and returns a new DataFrame with the extracted data along with any trace information.

Parameters:

df_extraction_ledger (pd.DataFrame) – A pandas DataFrame containing PDF documents. Must include a ‘content’ column with base64-encoded PDF data.
task_config (dict) – A dictionary of configuration parameters. Expected to include ‘task_properties’ and ‘validated_config’ keys.
extractor_config (Any) – A dictionary of configuration parameters for the extraction process.
execution_trace_log (list, optional) – A list for accumulating trace information during extraction. Defaults to None.

Returns:

A tuple where the first element is a DataFrame with the extracted data (with columns: document_type, metadata, uuid) and the second element is a dictionary containing trace information.

Return type:

tuple of (pd.DataFrame, dict)

Raises:

Exception – If an error occurs during the extraction process on any row.

nv_ingest_api.internal.extract.pdf package#

Subpackages#

Submodules#

nv_ingest_api.internal.extract.pdf.pdf_extractor module#

Module contents#