nv_ingest_api.internal.extract.docx.engines.docxreader_helpers package#
Submodules#
nv_ingest_api.internal.extract.docx.engines.docxreader_helpers.docx_helper module#
- nv_ingest_api.internal.extract.docx.engines.docxreader_helpers.docx_helper.python_docx(
- *,
- docx_stream: IO,
- extract_text: bool,
- extract_images: bool,
- extract_infographics: bool,
- extract_tables: bool,
- extract_charts: bool,
- extraction_config: dict,
- execution_trace_log: List | None = None,
Helper function that use python-docx to extract text from a bytestream document
A document has three levels - document, paragraphs and runs. To align with the pdf extraction paragraphs are aliased as block. python-docx leaves the page number and line number to the renderer so we assume that the entire document is a single page.
Run level parsing has been skipped but can be added as needed.
- Parameters:
docx_stream – Bytestream
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_infographics (bool) – Specifies whether to extract infographics.
extract_tables (bool) – Specifies whether to extract tables.
extract_charts (bool) – Specifies whether to extract charts.
extraction_config (dict) – A dictionary of configuration parameters for the extraction process.
execution_trace_log (list, optional) – A list for accumulating trace information during extraction. Defaults to None.
- Returns:
A string of extracted text.
- Return type:
str
nv_ingest_api.internal.extract.docx.engines.docxreader_helpers.docxreader module#
- class nv_ingest_api.internal.extract.docx.engines.docxreader_helpers.docxreader.DocxProperties(
- document: Document,
- source_metadata: Dict,
Bases:
object
Parse document core properties and update metadata
- class nv_ingest_api.internal.extract.docx.engines.docxreader_helpers.docxreader.DocxReader(
- docx,
- source_metadata: Dict,
- paragraph_format: str = 'markdown',
- table_format: str = 'markdown',
- handle_text_styles: bool = True,
- image_tag='<image {}>',
- table_tag='<table {}>',
- extraction_config: Dict | None = None,
Bases:
object
Read a docx file and extract its content as text, images and tables.
- Parameters:
docx – Bytestream
paragraph_format (str) – Format of the paragraphs. Supported formats are: [‘text’, ‘markdown’]
table_format (str) – Format of the tables. Supported formats are: [‘markdown’, ‘markdown_light’, ‘csv’, ‘tag’]
handle_text_styles (bool) – Whether to apply style on a paragraph (heading, list, title, subtitle). Not recommended if the document has been converted from pdf.
image_tag (str) – Tag to replace the images in the text. Must contain one placeholder for the image index.
table_tag (str) – Tag to replace the tables in the text. Must contain one placeholder for the table index.
- static apply_text_style(
- style: str,
- text: str,
- level: int = 0,
Apply a specific text style (e.g., heading, list, title, subtitle) to the given text.
- Parameters:
style (str) – The style to apply. Supported styles include headings (“Heading 1” to “Heading 9”), list items (“List”), and document structures (“Title”, “Subtitle”).
text (str) – The text to style.
level (int, optional) – The indentation level for the styled text. Default is 0.
- Returns:
The text with the specified style and indentation applied.
- Return type:
str
- static docx_content_type_to_image_type(
- content_type: MIME_TYPE,
Convert a DOCX content type string to an image type.
- Parameters:
content_type (MIME_TYPE) – The content type string from the image header, e.g., “image/jpeg”.
- Returns:
The image type extracted from the content type string.
- Return type:
str
- extract_data(
- base_unified_metadata: Dict,
- text_depth: TextTypeEnum,
- extract_text: bool,
- extract_charts: bool,
- extract_tables: bool,
- extract_images: bool,
Iterate over paragraphs and tables in a DOCX document to extract data.
- Parameters:
base_unified_metadata (dict) – The base metadata to associate with all extracted content.
text_depth (TextTypeEnum) – The depth of text extraction (e.g., block-level, document-level).
extract_text (bool) – Whether to extract text from the document.
extract_charts (bool) – Whether to extract charts from the document.
extract_tables (bool) – Whether to extract tables from the document.
extract_images (bool) – Whether to extract images from the document.
- Returns:
A dictionary containing the extracted data from the document.
- Return type:
dict
- format_cell(
- cell: _Cell,
Format a table cell into Markdown text and extract associated images.
- Parameters:
cell (_Cell) – The table cell to format.
- Returns:
The formatted text of the cell with markdown styling applied.
A list of images extracted from the cell.
- Return type:
tuple of (str, list of Image)
- format_paragraph(
- paragraph: Paragraph,
Format a paragraph into styled text and extract associated images.
- Parameters:
paragraph (Paragraph) – The paragraph to format. This includes text and potentially embedded images.
- Returns:
The formatted paragraph text with markdown styling applied.
A list of extracted images from the paragraph.
- Return type:
tuple of (str, list of Image)
- format_table(
- table: Table,
Format a table into text, extract images, and represent it as a DataFrame.
- Parameters:
table (Table) – The table to format.
- Returns:
The formatted table as text, using the specified format (e.g., markdown, CSV).
A list of images extracted from the table.
A DataFrame representation of the table’s content.
- Return type:
tuple of (str or None, list of Image, DataFrame)
- format_text(
- text: str,
- bold: bool,
- italic: bool,
- underline: bool,
Apply markdown styling (bold, italic, underline) to the given text.
- Parameters:
text (str) – The text to format.
bold (bool) – Whether to apply bold styling.
italic (bool) – Whether to apply italic styling.
underline (bool) – Whether to apply underline styling.
- Returns:
The formatted text with the applied styles.
- Return type:
str