nv_ingest_api.internal.extract.pdf package#
Subpackages#
- nv_ingest_api.internal.extract.pdf.engines package- Subpackages
- Submodules
- nv_ingest_api.internal.extract.pdf.engines.adobe module
- nv_ingest_api.internal.extract.pdf.engines.llama module
- nv_ingest_api.internal.extract.pdf.engines.nemoretriever module
- nv_ingest_api.internal.extract.pdf.engines.pdfium module
- nv_ingest_api.internal.extract.pdf.engines.tika module
- nv_ingest_api.internal.extract.pdf.engines.unstructured_io module
- Module contents
 
Submodules#
nv_ingest_api.internal.extract.pdf.pdf_extractor module#
- nv_ingest_api.internal.extract.pdf.pdf_extractor.extract_primitives_from_pdf_internal(
- df_extraction_ledger: DataFrame,
- task_config: Dict[str, Any],
- extractor_config: Any,
- execution_trace_log: List[Any] | None = None,
- Process a DataFrame of PDF documents by orchestrating extraction for each row. - This function applies the row-level orchestration function to every row in the DataFrame, aggregates the results, and returns a new DataFrame with the extracted data along with any trace information. - Parameters:
- df_extraction_ledger (pd.DataFrame) – A pandas DataFrame containing PDF documents. Must include a ‘content’ column with base64-encoded PDF data. 
- task_config (dict) – A dictionary of configuration parameters. Expected to include ‘task_properties’ and ‘validated_config’ keys. 
- extractor_config (Any) – A dictionary of configuration parameters for the extraction process. 
- execution_trace_log (list, optional) – A list for accumulating trace information during extraction. Defaults to None. 
 
- Returns:
- A tuple where the first element is a DataFrame with the extracted data (with columns: document_type, metadata, uuid) and the second element is a dictionary containing trace information. 
- Return type:
- tuple of (pd.DataFrame, dict) 
- Raises:
- Exception – If an error occurs during the extraction process on any row.