nv_ingest_api.internal.extract.docx.engines.docxreader_helpers package#

Submodules#

nv_ingest_api.internal.extract.docx.engines.docxreader_helpers.docx_helper module#

nv_ingest_api.internal.extract.docx.engines.docxreader_helpers.docx_helper.python_docx( *, docx_stream: IO, extract_text: bool, extract_images: bool, extract_infographics: bool, extract_tables: bool, extract_charts: bool, extraction_config: dict, execution_trace_log: List | None = None, )[source]#

Helper function that use python-docx to extract text from a bytestream document

A document has three levels - document, paragraphs and runs. To align with the pdf extraction paragraphs are aliased as block. python-docx leaves the page number and line number to the renderer so we assume that the entire document is a single page.

Run level parsing has been skipped but can be added as needed.

Parameters:

docx_stream – Bytestream
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_infographics (bool) – Specifies whether to extract infographics.
extract_tables (bool) – Specifies whether to extract tables.
extract_charts (bool) – Specifies whether to extract charts.
extraction_config (dict) – A dictionary of configuration parameters for the extraction process.
execution_trace_log (list, optional) – A list for accumulating trace information during extraction. Defaults to None.

Returns:

A string of extracted text.

Return type:

str

nv_ingest_api.internal.extract.docx.engines.docxreader_helpers.docxreader module#

class nv_ingest_api.internal.extract.docx.engines.docxreader_helpers.docxreader.DocxProperties( document: Document, source_metadata: Dict, )[source]#

Bases: object

Parse document core properties and update metadata

class nv_ingest_api.internal.extract.docx.engines.docxreader_helpers.docxreader.DocxReader( docx, source_metadata: Dict, paragraph_format: str = 'markdown', table_format: str = 'markdown', handle_text_styles: bool = True, image_tag='<image {}>', table_tag='<table {}>', extraction_config: Dict | None = None, )[source]#

Bases: object

Read a docx file and extract its content as text, images and tables.

Parameters:

docx – Bytestream
paragraph_format (str) – Format of the paragraphs. Supported formats are: [‘text’, ‘markdown’]
table_format (str) – Format of the tables. Supported formats are: [‘markdown’, ‘markdown_light’, ‘csv’, ‘tag’]
handle_text_styles (bool) – Whether to apply style on a paragraph (heading, list, title, subtitle). Not recommended if the document has been converted from pdf.
image_tag (str) – Tag to replace the images in the text. Must contain one placeholder for the image index.
table_tag (str) – Tag to replace the tables in the text. Must contain one placeholder for the table index.

static apply_text_style( style: str, text: str, level: int = 0, ) → str[source]#

Apply a specific text style (e.g., heading, list, title, subtitle) to the given text.

Parameters:

style (str) – The style to apply. Supported styles include headings (“Heading 1” to “Heading 9”), list items (“List”), and document structures (“Title”, “Subtitle”).
text (str) – The text to style.
level (int, optional) – The indentation level for the styled text. Default is 0.

Returns:

The text with the specified style and indentation applied.

Return type:

str

static docx_content_type_to_image_type( content_type: MIME_TYPE, ) → str[source]#

Convert a DOCX content type string to an image type.

Parameters:: content_type (MIME_TYPE) – The content type string from the image header, e.g., “image/jpeg”.
Returns:: The image type extracted from the content type string.
Return type:: str

extract_data( base_unified_metadata: Dict, text_depth: TextTypeEnum, extract_text: bool, extract_charts: bool, extract_tables: bool, extract_images: bool, ) → list[list[str | dict]][source]#

Iterate over paragraphs and tables in a DOCX document to extract data.

Parameters:

base_unified_metadata (dict) – The base metadata to associate with all extracted content.
text_depth (TextTypeEnum) – The depth of text extraction (e.g., block-level, document-level).
extract_text (bool) – Whether to extract text from the document.
extract_charts (bool) – Whether to extract charts from the document.
extract_tables (bool) – Whether to extract tables from the document.
extract_images (bool) – Whether to extract images from the document.

Returns:

A dictionary containing the extracted data from the document.

Return type:

dict

format_cell( cell: _Cell, ) → Tuple[str, List[Image]][source]#

Format a table cell into Markdown text and extract associated images.

Parameters:

cell (_Cell) – The table cell to format.

Returns:

The formatted text of the cell with markdown styling applied.
A list of images extracted from the cell.

Return type:

tuple of (str, list of Image)

format_paragraph( paragraph: Paragraph, ) → Tuple[str, List[Image]][source]#

Format a paragraph into styled text and extract associated images.

Parameters:

paragraph (Paragraph) – The paragraph to format. This includes text and potentially embedded images.

Returns:

The formatted paragraph text with markdown styling applied.
A list of extracted images from the paragraph.

Return type:

tuple of (str, list of Image)

format_table( table: Table, ) → Tuple[str | None, List[Image], DataFrame][source]#

Format a table into text, extract images, and represent it as a DataFrame.

Parameters:

table (Table) – The table to format.

Returns:

The formatted table as text, using the specified format (e.g., markdown, CSV).
A list of images extracted from the table.
A DataFrame representation of the table’s content.

Return type:

tuple of (str or None, list of Image, DataFrame)

format_text( text: str, bold: bool, italic: bool, underline: bool, ) → str[source]#

Apply markdown styling (bold, italic, underline) to the given text.

Parameters:

text (str) – The text to format.
bold (bool) – Whether to apply bold styling.
italic (bool) – Whether to apply italic styling.
underline (bool) – Whether to apply underline styling.

Returns:

The formatted text with the applied styles.

Return type:

str

is_text_empty(text: str) → bool[source]#

Check if the given text is empty or matches the empty text pattern.

Parameters:: text (str) – The text to check.
Returns:: True if the text is empty or matches the empty text pattern, False otherwise.
Return type:: bool

nv_ingest_api.internal.extract.docx.engines.docxreader_helpers package#

Submodules#

nv_ingest_api.internal.extract.docx.engines.docxreader_helpers.docx_helper module#

nv_ingest_api.internal.extract.docx.engines.docxreader_helpers.docxreader module#

Module contents#