nv_ingest_api.internal.extract.docx.engines.docxreader_helpers package#

Submodules#

nv_ingest_api.internal.extract.docx.engines.docxreader_helpers.docx_helper module#

nv_ingest_api.internal.extract.docx.engines.docxreader_helpers.docx_helper.python_docx(
*,
docx_stream: IO,
extract_text: bool,
extract_images: bool,
extract_infographics: bool,
extract_tables: bool,
extract_charts: bool,
extraction_config: dict,
execution_trace_log: List | None = None,
)[source]#

Helper function that use python-docx to extract text from a bytestream document

A document has three levels - document, paragraphs and runs. To align with the pdf extraction paragraphs are aliased as block. python-docx leaves the page number and line number to the renderer so we assume that the entire document is a single page.

Run level parsing has been skipped but can be added as needed.

Parameters:
  • docx_stream – Bytestream

  • extract_text (bool) – Specifies whether to extract text.

  • extract_images (bool) – Specifies whether to extract images.

  • extract_infographics (bool) – Specifies whether to extract infographics.

  • extract_tables (bool) – Specifies whether to extract tables.

  • extract_charts (bool) – Specifies whether to extract charts.

  • extraction_config (dict) – A dictionary of configuration parameters for the extraction process.

  • execution_trace_log (list, optional) – A list for accumulating trace information during extraction. Defaults to None.

Returns:

A string of extracted text.

Return type:

str

nv_ingest_api.internal.extract.docx.engines.docxreader_helpers.docxreader module#

class nv_ingest_api.internal.extract.docx.engines.docxreader_helpers.docxreader.DocxProperties(
document: Document,
source_metadata: Dict,
)[source]#

Bases: object

Parse document core properties and update metadata

class nv_ingest_api.internal.extract.docx.engines.docxreader_helpers.docxreader.DocxReader(
docx,
source_metadata: Dict,
paragraph_format: str = 'markdown',
table_format: str = 'markdown',
handle_text_styles: bool = True,
image_tag='<image {}>',
table_tag='<table {}>',
extraction_config: Dict | None = None,
)[source]#

Bases: object

Read a docx file and extract its content as text, images and tables.

Parameters:
  • docx – Bytestream

  • paragraph_format (str) – Format of the paragraphs. Supported formats are: [‘text’, ‘markdown’]

  • table_format (str) – Format of the tables. Supported formats are: [‘markdown’, ‘markdown_light’, ‘csv’, ‘tag’]

  • handle_text_styles (bool) – Whether to apply style on a paragraph (heading, list, title, subtitle). Not recommended if the document has been converted from pdf.

  • image_tag (str) – Tag to replace the images in the text. Must contain one placeholder for the image index.

  • table_tag (str) – Tag to replace the tables in the text. Must contain one placeholder for the table index.

static apply_text_style(
style: str,
text: str,
level: int = 0,
) str[source]#

Apply a specific text style (e.g., heading, list, title, subtitle) to the given text.

Parameters:
  • style (str) – The style to apply. Supported styles include headings (“Heading 1” to “Heading 9”), list items (“List”), and document structures (“Title”, “Subtitle”).

  • text (str) – The text to style.

  • level (int, optional) – The indentation level for the styled text. Default is 0.

Returns:

The text with the specified style and indentation applied.

Return type:

str

static docx_content_type_to_image_type(
content_type: MIME_TYPE,
) str[source]#

Convert a DOCX content type string to an image type.

Parameters:

content_type (MIME_TYPE) – The content type string from the image header, e.g., “image/jpeg”.

Returns:

The image type extracted from the content type string.

Return type:

str

extract_data(
base_unified_metadata: Dict,
text_depth: TextTypeEnum,
extract_text: bool,
extract_charts: bool,
extract_tables: bool,
extract_images: bool,
) list[list[str | dict]][source]#

Iterate over paragraphs and tables in a DOCX document to extract data.

Parameters:
  • base_unified_metadata (dict) – The base metadata to associate with all extracted content.

  • text_depth (TextTypeEnum) – The depth of text extraction (e.g., block-level, document-level).

  • extract_text (bool) – Whether to extract text from the document.

  • extract_charts (bool) – Whether to extract charts from the document.

  • extract_tables (bool) – Whether to extract tables from the document.

  • extract_images (bool) – Whether to extract images from the document.

Returns:

A dictionary containing the extracted data from the document.

Return type:

dict

format_cell(
cell: _Cell,
) Tuple[str, List[Image]][source]#

Format a table cell into Markdown text and extract associated images.

Parameters:

cell (_Cell) – The table cell to format.

Returns:

  • The formatted text of the cell with markdown styling applied.

  • A list of images extracted from the cell.

Return type:

tuple of (str, list of Image)

format_paragraph(
paragraph: Paragraph,
) Tuple[str, List[Image]][source]#

Format a paragraph into styled text and extract associated images.

Parameters:

paragraph (Paragraph) – The paragraph to format. This includes text and potentially embedded images.

Returns:

  • The formatted paragraph text with markdown styling applied.

  • A list of extracted images from the paragraph.

Return type:

tuple of (str, list of Image)

format_table(
table: Table,
) Tuple[str | None, List[Image], DataFrame][source]#

Format a table into text, extract images, and represent it as a DataFrame.

Parameters:

table (Table) – The table to format.

Returns:

  • The formatted table as text, using the specified format (e.g., markdown, CSV).

  • A list of images extracted from the table.

  • A DataFrame representation of the table’s content.

Return type:

tuple of (str or None, list of Image, DataFrame)

format_text(
text: str,
bold: bool,
italic: bool,
underline: bool,
) str[source]#

Apply markdown styling (bold, italic, underline) to the given text.

Parameters:
  • text (str) – The text to format.

  • bold (bool) – Whether to apply bold styling.

  • italic (bool) – Whether to apply italic styling.

  • underline (bool) – Whether to apply underline styling.

Returns:

The formatted text with the applied styles.

Return type:

str

is_text_empty(text: str) bool[source]#

Check if the given text is empty or matches the empty text pattern.

Parameters:

text (str) – The text to check.

Returns:

True if the text is empty or matches the empty text pattern, False otherwise.

Return type:

bool

Module contents#