nemo_retriever.utils.convert package#

Submodules#

nemo_retriever.utils.convert.to_pdf module#

Convert DOCX/PPTX files to PDF bytes via LibreOffice headless.

class nemo_retriever.utils.convert.to_pdf.DocToPdfConversionActor[source]#

Bases: ArchetypeOperator

class nemo_retriever.utils.convert.to_pdf.DocToPdfConversionCPUActor[source]#

Bases: AbstractOperator, CPUOperator

Ray Data actor that converts DOCX/PPTX batches to PDF.

Used with ray.data.Dataset.map_batches in the same style as PDFSplitActor.

postprocess(
data: Any,
**kwargs: Any,
) Any[source]#
preprocess(
data: Any,
**kwargs: Any,
) Any[source]#
process(
data: Any,
**kwargs: Any,
) Any[source]#
nemo_retriever.utils.convert.to_pdf.convert_batch_to_pdf(
batch_df: Any,
) DataFrame[source]#

Convert a batch of files to PDF, passing PDFs through unchanged.

Expects a pandas.DataFrame with at least bytes and path columns (the same schema produced by ray.data.read_binary_files). Rows whose path ends with a supported non-PDF extension are converted; rows that are already PDFs are returned as-is. On error, an error record is emitted (matching the pattern in pdf/split.py).

nemo_retriever.utils.convert.to_pdf.convert_to_pdf_bytes(file_bytes: bytes, extension: str) bytes[source]#

Convert file bytes to PDF bytes.

If extension is ".pdf", return file_bytes unchanged. For ".docx" / ".pptx", write to a temp dir, invoke libreoffice --headless --convert-to pdf, and return the resulting PDF bytes.

Raises:
  • FileNotFoundError – If the libreoffice binary is not on $PATH.

  • subprocess.CalledProcessError – If LibreOffice conversion fails.

  • RuntimeError – If the expected PDF output file is missing after conversion.

Module contents#

Document-to-PDF conversion utilities.

class nemo_retriever.utils.convert.DocToPdfConversionActor[source]#

Bases: ArchetypeOperator

nemo_retriever.utils.convert.convert_to_pdf_bytes(file_bytes: bytes, extension: str) bytes[source]#

Convert file bytes to PDF bytes.

If extension is ".pdf", return file_bytes unchanged. For ".docx" / ".pptx", write to a temp dir, invoke libreoffice --headless --convert-to pdf, and return the resulting PDF bytes.

Raises:
  • FileNotFoundError – If the libreoffice binary is not on $PATH.

  • subprocess.CalledProcessError – If LibreOffice conversion fails.

  • RuntimeError – If the expected PDF output file is missing after conversion.