nv_ingest_api.internal.extract.pdf package#
Subpackages#
- nv_ingest_api.internal.extract.pdf.engines package
- Subpackages
- Submodules
- nv_ingest_api.internal.extract.pdf.engines.adobe module
- nv_ingest_api.internal.extract.pdf.engines.llama module
- nv_ingest_api.internal.extract.pdf.engines.nemoretriever module
- nv_ingest_api.internal.extract.pdf.engines.pdfium module
- nv_ingest_api.internal.extract.pdf.engines.tika module
- nv_ingest_api.internal.extract.pdf.engines.unstructured_io module
- Module contents
Submodules#
nv_ingest_api.internal.extract.pdf.pdf_extractor module#
- nv_ingest_api.internal.extract.pdf.pdf_extractor.extract_primitives_from_pdf_internal(
- df_extraction_ledger: DataFrame,
- task_config: Dict[str, Any],
- extractor_config: Any,
- execution_trace_log: List[Any] | None = None,
Process a DataFrame of PDF documents by orchestrating extraction for each row.
This function applies the row-level orchestration function to every row in the DataFrame, aggregates the results, and returns a new DataFrame with the extracted data along with any trace information.
- Parameters:
df_extraction_ledger (pd.DataFrame) – A pandas DataFrame containing PDF documents. Must include a ‘content’ column with base64-encoded PDF data.
task_config (dict) – A dictionary of configuration parameters. Expected to include ‘task_properties’ and ‘validated_config’ keys.
extractor_config (Any) – A dictionary of configuration parameters for the extraction process.
execution_trace_log (list, optional) – A list for accumulating trace information during extraction. Defaults to None.
- Returns:
A tuple where the first element is a DataFrame with the extracted data (with columns: document_type, metadata, uuid) and the second element is a dictionary containing trace information.
- Return type:
tuple of (pd.DataFrame, dict)
- Raises:
Exception – If an error occurs during the extraction process on any row.