nemo_retriever.utils.convert package#
Submodules#
nemo_retriever.utils.convert.to_pdf module#
Convert DOCX/PPTX files to PDF bytes via LibreOffice headless.
- class nemo_retriever.utils.convert.to_pdf.DocToPdfConversionActor[source]#
Bases:
ArchetypeOperator
- class nemo_retriever.utils.convert.to_pdf.DocToPdfConversionCPUActor[source]#
Bases:
AbstractOperator,CPUOperatorRay Data actor that converts DOCX/PPTX batches to PDF.
Used with
ray.data.Dataset.map_batchesin the same style asPDFSplitActor.
- nemo_retriever.utils.convert.to_pdf.convert_batch_to_pdf(
- batch_df: Any,
Convert a batch of files to PDF, passing PDFs through unchanged.
Expects a
pandas.DataFramewith at leastbytesandpathcolumns (the same schema produced byray.data.read_binary_files). Rows whose path ends with a supported non-PDF extension are converted; rows that are already PDFs are returned as-is. On error, an error record is emitted (matching the pattern inpdf/split.py).
- nemo_retriever.utils.convert.to_pdf.convert_to_pdf_bytes(file_bytes: bytes, extension: str) bytes[source]#
Convert file bytes to PDF bytes.
If extension is
".pdf", return file_bytes unchanged. For".docx"/".pptx", write to a temp dir, invokelibreoffice --headless --convert-to pdf, and return the resulting PDF bytes.- Raises:
FileNotFoundError – If the
libreofficebinary is not on$PATH.subprocess.CalledProcessError – If LibreOffice conversion fails.
RuntimeError – If the expected PDF output file is missing after conversion.
Module contents#
Document-to-PDF conversion utilities.
- class nemo_retriever.utils.convert.DocToPdfConversionActor[source]#
Bases:
ArchetypeOperator
- nemo_retriever.utils.convert.convert_to_pdf_bytes(file_bytes: bytes, extension: str) bytes[source]#
Convert file bytes to PDF bytes.
If extension is
".pdf", return file_bytes unchanged. For".docx"/".pptx", write to a temp dir, invokelibreoffice --headless --convert-to pdf, and return the resulting PDF bytes.- Raises:
FileNotFoundError – If the
libreofficebinary is not on$PATH.subprocess.CalledProcessError – If LibreOffice conversion fails.
RuntimeError – If the expected PDF output file is missing after conversion.