nv_ingest_api.internal.extract.pptx package#
Subpackages#
Submodules#
nv_ingest_api.internal.extract.pptx.pptx_extractor module#
- nv_ingest_api.internal.extract.pptx.pptx_extractor.extract_primitives_from_pptx_internal(
- df_extraction_ledger: DataFrame,
- task_config: Dict[str, Any] | BaseModel,
- extraction_config: Any,
- execution_trace_log: Dict[str, Any] | None = None,
- Process a DataFrame containing base64-encoded PPTX files and extract primitive data. - This function applies a decoding and extraction routine to each row of the DataFrame (via _decode_and_extract_from_pptx), then explodes any list results into separate rows, drops missing values, and compiles the extracted data into a new DataFrame. The resulting DataFrame includes columns for document type, extracted metadata, and a unique identifier (UUID). - Parameters:
- df_extraction_ledger (pd.DataFrame) – Input DataFrame with PPTX files in base64 encoding. Expected to include columns ‘source_id’ and ‘content’. 
- task_config (Union[Dict[str, Any], BaseModel]) – Configuration for the PPTX extraction task, as a dict or Pydantic model. 
- extraction_config (Any) – Configuration object for PPTX extraction (e.g. PPTXExtractorSchema). 
- execution_trace_log (Optional[Dict[str, Any]], optional) – Optional dictionary containing trace information for debugging. 
 
- Returns:
- DataFrame with extracted PPTX content containing columns: “document_type”, “metadata”, and “uuid”. 
- Return type:
- pd.DataFrame 
- Raises:
- Exception – Reraises any exception encountered during extraction with additional context.