nv_ingest.framework.orchestration.morpheus.stages.extractors package#
Submodules#
nv_ingest.framework.orchestration.morpheus.stages.extractors.audio_extraction_stage module#
- nv_ingest.framework.orchestration.morpheus.stages.extractors.audio_extraction_stage.generate_audio_extractor_stage(
- c: Config,
- stage_config: Dict[str, Any],
- task: str = 'audio_data_extract',
- task_desc: str = 'audio_data_extraction',
- pe_count: int = 1,
Generates a multiprocessing stage to perform audio data extraction.
- Parameters:
c (Config) – Morpheus global configuration object.
stage_config (Dict[str, Any]) – Configuration parameters for the audio content extractor, passed as a dictionary validated against the AudioExtractorSchema.
task (str, optional) – The task name for the stage worker function, defining the specific audio extraction process. Default is “audio_data_extract”.
task_desc (str, optional) – A descriptor used for latency tracing and logging during audio extraction. Default is “audio_data_extraction”.
pe_count (int, optional) – The number of process engines to use for audio data extraction. This value controls how many worker processes will run concurrently. Default is 1.
- Returns:
A configured Morpheus stage with an applied worker function that handles audio data extraction from PDF content.
- Return type:
nv_ingest.framework.orchestration.morpheus.stages.extractors.chart_extraction_stage module#
- nv_ingest.framework.orchestration.morpheus.stages.extractors.chart_extraction_stage.generate_chart_extractor_stage(
- c: Config,
- extractor_config: Dict[str, Any],
- task: str = 'chart_data_extract',
- task_desc: str = 'chart_data_extraction',
- pe_count: int = 1,
Generates a multiprocessing stage to perform chart data extraction from PDF content.
- Parameters:
c (Config) – Morpheus global configuration object.
extractor_config (Dict[str, Any]) – Configuration parameters for the chart content extractor, passed as a dictionary validated against the ChartExtractorSchema.
task (str, optional) – The task name for the stage worker function, defining the specific chart extraction process. Default is “chart_data_extract”.
task_desc (str, optional) – A descriptor used for latency tracing and logging during chart extraction. Default is “chart_data_extraction”.
pe_count (int, optional) – The number of process engines to use for chart data extraction. This value controls how many worker processes will run concurrently. Default is 1.
- Returns:
A configured Morpheus stage with an applied worker function that handles chart data extraction from PDF content.
- Return type:
nv_ingest.framework.orchestration.morpheus.stages.extractors.docx_extractor_stage module#
- nv_ingest.framework.orchestration.morpheus.stages.extractors.docx_extractor_stage.generate_docx_extractor_stage(
- c: Config,
- extraction_config: dict,
- task: str = 'docx-extract',
- task_desc: str = 'docx_content_extractor',
- pe_count: int = 8,
Helper function to generate a multiprocessing stage to perform document content extraction.
- Parameters:
c (Config) – Morpheus global configuration object.
extraction_config (dict) – Configuration parameters for document content extractor.
task (str) – The task name to match for the stage worker function.
task_desc (str) – A descriptor to be used in latency tracing.
pe_count (int) – The number of process engines to use for document content extraction.
- Returns:
A Morpheus stage with the applied worker function.
- Return type:
- Raises:
Exception – If an error occurs during stage generation.
nv_ingest.framework.orchestration.morpheus.stages.extractors.image_extractor_stage module#
- nv_ingest.framework.orchestration.morpheus.stages.extractors.image_extractor_stage.generate_image_extractor_stage(
- c: Config,
- extraction_config: Dict[str, Any],
- task: str = 'extract',
- task_desc: str = 'image_content_extractor',
- pe_count: int = 24,
Helper function to generate a multiprocessing stage to perform image content extraction.
- Parameters:
c (Config) – Morpheus global configuration object.
extraction_config (dict) – Configuration parameters for image content extractor.
task (str) – The task name to match for the stage worker function.
task_desc (str) – A descriptor to be used in latency tracing.
pe_count (int) – The number of process engines to use for image content extraction.
- Returns:
A Morpheus stage with the applied worker function.
- Return type:
nv_ingest.framework.orchestration.morpheus.stages.extractors.infographic_extraction_stage module#
- nv_ingest.framework.orchestration.morpheus.stages.extractors.infographic_extraction_stage.generate_infographic_extractor_stage(
- c: Config,
- extraction_config: Dict[str, Any],
- task: str = 'infographic_data_extract',
- task_desc: str = 'infographic_data_extraction',
- pe_count: int = 1,
Generates a multiprocessing stage to perform infographic data extraction from PDF content.
- Parameters:
c (Config) – Morpheus global configuration object.
extraction_config (Dict[str, Any]) – Configuration parameters for the infographic content extractor, passed as a dictionary validated against the TableExtractorSchema.
task (str, optional) – The task name for the stage worker function, defining the specific infographic extraction process. Default is “infographic_data_extract”.
task_desc (str, optional) – A descriptor used for latency tracing and logging during infographic extraction. Default is “infographic_data_extraction”.
pe_count (int, optional) – The number of process engines to use for infographic data extraction. This value controls how many worker processes will run concurrently. Default is 1.
- Returns:
A configured Morpheus stage with an applied worker function that handles infographic data extraction from PDF content.
- Return type:
nv_ingest.framework.orchestration.morpheus.stages.extractors.pdf_extractor_stage module#
- nv_ingest.framework.orchestration.morpheus.stages.extractors.pdf_extractor_stage.generate_pdf_extractor_stage(
- c: Any,
- extractor_config: Dict[str, Any],
- task: str = 'extract',
- task_desc: str = 'pdf_content_extractor',
- pe_count: int = 24,
Generate a multiprocessing stage for PDF extraction.
This function validates the extractor configuration, creates a partial function wrapper to inject the validated configuration into the config dict, and returns a MultiProcessingBaseStage for parallel PDF extraction.
- Parameters:
c (Any) – The global configuration object for the pipeline.
extractor_config (dict) – A dictionary containing configuration parameters for the PDF extractor.
task (str, optional) – The name of the extraction task. Defaults to “extract”.
task_desc (str, optional) – A descriptor for the task used in latency tracing. Defaults to “pdf_content_extractor”.
pe_count (int, optional) – The number of processing engines to use for extraction. Defaults to 24.
- Returns:
A MultiProcessingBaseStage object configured for PDF extraction.
- Return type:
Any
- Raises:
Exception – If an error occurs during the creation of the PDF extractor stage.
nv_ingest.framework.orchestration.morpheus.stages.extractors.pptx_extractor_stage module#
- nv_ingest.framework.orchestration.morpheus.stages.extractors.pptx_extractor_stage.generate_pptx_extractor_stage(
- c: Config,
- extraction_config: dict,
- task: str = 'pptx-extract',
- task_desc: str = 'pptx_content_extractor',
- pe_count: int = 8,
Helper function to generate a multiprocessing stage to perform PPTX content extraction.
- Parameters:
c (Config) – Morpheus global configuration object.
extraction_config (dict) – Configuration parameters for PPTX content extractor.
task (str) – The task name to match for the stage worker function.
task_desc (str) – A descriptor to be used in latency tracing.
pe_count (int) – The number of process engines to use for PPTX content extraction.
- Returns:
A Morpheus stage with the applied worker function.
- Return type:
- Raises:
Exception – If an error occurs during stage generation.
nv_ingest.framework.orchestration.morpheus.stages.extractors.table_extraction_stage module#
- nv_ingest.framework.orchestration.morpheus.stages.extractors.table_extraction_stage.generate_table_extractor_stage(
- c: Config,
- extraction_config: Dict[str, Any],
- task: str = 'table_data_extract',
- task_desc: str = 'table_data_extraction',
- pe_count: int = 1,
Generates a multiprocessing stage to perform table data extraction from PDF content.
- Parameters:
c (Config) – Morpheus global configuration object.
extraction_config (Dict[str, Any]) – Configuration parameters for the table content extractor, passed as a dictionary validated against the TableExtractorSchema.
task (str, optional) – The task name for the stage worker function, defining the specific table extraction process. Default is “table_data_extract”.
task_desc (str, optional) – A descriptor used for latency tracing and logging during table extraction. Default is “table_data_extraction”.
pe_count (int, optional) – The number of process engines to use for table data extraction. This value controls how many worker processes will run concurrently. Default is 1.
- Returns:
A configured Morpheus stage with an applied worker function that handles table data extraction from PDF content.
- Return type: