nv_ingest_api.internal.extract.docx package#

Subpackages#

nv_ingest_api.internal.extract.docx.engines package
- Subpackages
  - nv_ingest_api.internal.extract.docx.engines.docxreader_helpers package
- Module contents

Submodules#

nv_ingest_api.internal.extract.docx.docx_extractor module#

nv_ingest_api.internal.extract.docx.docx_extractor.extract_primitives_from_docx_internal( df_extraction_ledger: DataFrame, task_config: Dict[str, Any] | BaseModel, extraction_config: DocxExtractorSchema, execution_trace_log: Dict[str, Any] | None = None, ) → Tuple[DataFrame, Dict | None][source]#

Processes a pandas DataFrame containing DOCX files encoded in base64, extracting text from each document and replacing the original content with the extracted text.

This function applies a decoding and extraction routine to each row of the input DataFrame. The routine is provided via the decode_and_extract function, which is partially applied with task configuration, extraction configuration, and optional trace information. The results are exploded and any missing values are dropped, then compiled into a new DataFrame with columns for document type, metadata, and a UUID identifier.

Parameters:

df_extraction_ledger (pd.DataFrame) – The input DataFrame containing DOCX files in base64 encoding. Expected columns include ‘source_id’ and ‘content’.
task_config (Union[Dict[str, Any], BaseModel]) – Configuration instructions for the document processing task. This can be provided as a dictionary or a Pydantic model.
extraction_config (Any) – A configuration object for document extraction that guides the extraction process.
execution_trace_log (Optional[Dict[str, Any]], default=None) – An optional dictionary containing trace information for debugging or logging.

Returns:

A DataFrame with the original DOCX content replaced by the extracted text. The resulting DataFrame contains the columns “document_type”, “metadata”, and “uuid”.

Return type:

pd.DataFrame

Raises:

Exception – If an error occurs during the document extraction process, the exception is logged and re-raised.

nv_ingest_api.internal.extract.docx package#

Subpackages#

Submodules#

nv_ingest_api.internal.extract.docx.docx_extractor module#

Module contents#