nv_ingest_api.internal.extract.image package#
Subpackages#
Submodules#
nv_ingest_api.internal.extract.image.chart_extractor module#
- nv_ingest_api.internal.extract.image.chart_extractor.extract_chart_data_from_image_internal(
- df_extraction_ledger: DataFrame,
- task_config: IngestTaskChartExtraction | Dict[str, Any],
- extraction_config: ChartExtractorSchema,
- execution_trace_log: Dict | None = None,
Extracts chart data from a DataFrame in a bulk fashion rather than row-by-row.
- Parameters:
df_extraction_ledger (pd.DataFrame) – DataFrame containing the content from which chart data is to be extracted.
task_config (Dict[str, Any]) – Dictionary containing task properties and configurations.
extraction_config (Any) – The validated configuration object for chart extraction.
execution_trace_log (Optional[Dict], optional) – Optional trace information for debugging or logging. Defaults to None.
- Returns:
A tuple containing the updated DataFrame and the trace information.
- Return type:
Tuple[pd.DataFrame, Dict]
- Raises:
Exception – If any error occurs during the chart data extraction process.
nv_ingest_api.internal.extract.image.image_extractor module#
- nv_ingest_api.internal.extract.image.image_extractor.extract_primitives_from_image_internal(
- df_extraction_ledger: DataFrame,
- task_config: Dict[str, Any] | BaseModel,
- extraction_config: Any,
- execution_trace_log: Dict[str, Any] | None = None,
Process a DataFrame containing base64-encoded image files and extract primitives from each image.
This function applies the decode_and_extract_from_image routine to every row of the input DataFrame. It then explodes any list results into separate rows, drops missing values, and compiles the extracted data into a new DataFrame with columns “document_type”, “metadata”, and “uuid”. In addition, trace information is collected if provided.
- Parameters:
df_extraction_ledger (pd.DataFrame) – Input DataFrame containing image files in base64 encoding. Expected to include columns ‘source_id’ and ‘content’.
task_config (Union[Dict[str, Any], BaseModel]) – A dictionary or Pydantic model with instructions and parameters for the image processing task.
extraction_config (Any) – A configuration object validated for processing images (e.g., containing image_extraction_config).
execution_trace_log (Optional[Dict[str, Any]], default=None) – An optional dictionary for tracing and logging additional information during processing.
- Returns:
A DataFrame with the extracted image primitives. Expected columns include “document_type”, “metadata”, and “uuid”. Also returns a dictionary containing trace information under the key “trace_info”.
- Return type:
pd.DataFrame
- Raises:
Exception – If an error occurs during the image processing stage, the exception is logged and re-raised.
nv_ingest_api.internal.extract.image.infographic_extractor module#
- nv_ingest_api.internal.extract.image.infographic_extractor.extract_infographic_data_from_image_internal(
- df_extraction_ledger: DataFrame,
- task_config: Dict[str, Any],
- extraction_config: InfographicExtractorSchema,
- execution_trace_log: Dict | None = None,
Extracts infographic data from a DataFrame in bulk, following the chart extraction pattern.
- Parameters:
df_extraction_ledger (pd.DataFrame) – DataFrame containing the content from which infographic data is to be extracted.
task_config (Dict[str, Any]) – Dictionary containing task properties and configurations.
extraction_config (Any) – The validated configuration object for infographic extraction.
execution_trace_log (Optional[Dict], optional) – Optional trace information for debugging or logging. Defaults to None.
- Returns:
A tuple containing the updated DataFrame and the trace information.
- Return type:
Tuple[pd.DataFrame, Dict]
nv_ingest_api.internal.extract.image.table_extractor module#
- nv_ingest_api.internal.extract.image.table_extractor.extract_table_data_from_image_internal(
- df_extraction_ledger: DataFrame,
- task_config: IngestTaskTableExtraction | Dict[str, Any],
- extraction_config: TableExtractorSchema,
- execution_trace_log: Dict | None = None,
Extracts table data from a DataFrame in a bulk fashion rather than row-by-row, following the chart extraction pattern.
- Parameters:
df_extraction_ledger (pd.DataFrame) – DataFrame containing the content from which table data is to be extracted.
task_config (Dict[str, Any]) – Dictionary containing task properties and configurations.
extraction_config (Any) – The validated configuration object for table extraction.
execution_trace_log (Optional[Dict], optional) – Optional trace information for debugging or logging. Defaults to None.
- Returns:
A tuple containing the updated DataFrame and the trace information.
- Return type:
Tuple[pd.DataFrame, Dict]