nemo_retriever.utils package#
Subpackages#
- nemo_retriever.utils.benchmark package
- Submodules
- nemo_retriever.utils.benchmark.all_actor module
- nemo_retriever.utils.benchmark.audio_extract_actor module
- nemo_retriever.utils.benchmark.common module
- nemo_retriever.utils.benchmark.extract_actor module
- nemo_retriever.utils.benchmark.ocr_actor module
- nemo_retriever.utils.benchmark.page_elements_actor module
- nemo_retriever.utils.benchmark.split_actor module
- Module contents
- nemo_retriever.utils.compare package
- nemo_retriever.utils.convert package
- nemo_retriever.utils.image package
- nemo_retriever.utils.pipeline package
Submodules#
nemo_retriever.utils.detection_summary module#
Shared detection summary logic.
Provides a single function that accumulates per-page detection counters from
an iterable of (page_key, metadata_dict, row_dict) tuples. Both the
batch pipeline (reading from LanceDB) and inprocess pipeline (reading from
a DataFrame) can produce these tuples, allowing the summary computation to
be shared.
- nemo_retriever.utils.detection_summary.collect_detection_summary_from_df(
- df,
Collect detection summary from a pandas DataFrame.
- nemo_retriever.utils.detection_summary.collect_detection_summary_from_lancedb(
- uri: str,
- table_name: str,
Collect detection summary from a LanceDB table.
- nemo_retriever.utils.detection_summary.compute_detection_summary(
- rows: Iterable[Tuple[Any, Dict[str, Any], Dict[str, Any]]],
Compute deduped detection totals from an iterable of page data.
Each element is
(page_key, metadata_dict, row_dict)where:page_key is a hashable value used to deduplicate exploded content rows (e.g.
(source_id, page_number)).metadata_dict is the parsed JSON metadata (may contain counters from the LanceDB metadata column or from direct DataFrame columns).
row_dict is the raw row dict, used as fallback for counters stored as top-level DataFrame columns (e.g.
table,chartlists).
- nemo_retriever.utils.detection_summary.iter_dataframe_rows(df)[source]#
Yield
(page_key, meta, row_dict)tuples from a pandas DataFrame.
- nemo_retriever.utils.detection_summary.print_detection_summary(
- summary: Dict[str, Any] | None,
Print a detection summary to stdout.
- nemo_retriever.utils.detection_summary.print_pages_per_second(
- processed_pages: int | None,
- ingest_elapsed_s: float,
Print pages-per-second throughput to stdout.
- nemo_retriever.utils.detection_summary.print_run_summary(
- processed_pages: int | None,
- input_path: Path,
- vdb_op: str,
- vdb_kwargs: Dict[str, Any] | None,
- total_time: float,
- ingest_only_total_time: float,
- ray_dataset_download_total_time: float,
- vdb_upload_total_time: float,
- evaluation_total_time: float = 0.0,
- evaluation_metrics: Dict[str, float] | None = None,
- recall_total_time: float = 0.0,
- recall_metrics: Dict[str, float] | None = None,
- processed_files: int | None = None,
- evaluation_label: str = 'Recall',
- evaluation_count: int | None = None,
Print a human-readable run summary and return all metrics as a dict.
The returned dict is the authoritative structured representation of every metric collected during the run. Callers should persist it to a JSON file so that the harness can read it directly instead of parsing stdout.
nemo_retriever.utils.hf_cache module#
- nemo_retriever.utils.hf_cache.collect_hf_runtime_env(
- *,
- default_hf_hub_offline: str = '0',
- extra_keys: Iterable[str] = (),
Collect HF-related environment variables to forward to Ray workers.
- Parameters:
default_hf_hub_offline – Value to emit for
HF_HUB_OFFLINEwhen it is not set in the parent process environment. The default keeps online Hub checks enabled.extra_keys – Additional environment variable names to forward if they are set. Duplicates of built-in keys are ignored after their first occurrence.
- Returns:
Environment variables for Ray
runtime_env["env_vars"]. Explicitly blank environment values are preserved.- Return type:
dict[str, str]
nemo_retriever.utils.hf_model_registry module#
Central registry of pinned HuggingFace model revisions.
Every from_pretrained call in the codebase should pass
revision=get_hf_revision(model_id) and direct hf_hub_download calls
should use hf_hub_download_with_pinned_revision so that we always
download an exact, immutable snapshot rather than tracking the mutable
main branch.
To bump a model version, update the corresponding SHA in
HF_MODEL_REVISIONS and re-test.
- nemo_retriever.utils.hf_model_registry.get_hf_revision(model_id: str, *, strict: bool = True) str | None[source]#
Return the pinned commit SHA for model_id.
- Parameters:
model_id – HuggingFace model identifier (e.g.
"nvidia/parakeet-ctc-1.1b").strict – When
True(the default), raiseValueErrorif model_id has no pinned revision. WhenFalse, log a warning and returnNoneso thatfrom_pretrainedfalls back to themainbranch.
- nemo_retriever.utils.hf_model_registry.hf_hub_download(*args: Any, **kwargs: Any) str[source]#
Proxy to Hugging Face’s downloader, imported lazily.
- nemo_retriever.utils.hf_model_registry.hf_hub_download_with_pinned_revision(
- *args: Any,
- **kwargs: Any,
Call
hf_hub_downloadwith a registry revision when one is known.- Parameters:
*args – Positional arguments forwarded to
huggingface_hub.hf_hub_download. When present, the first positional argument is treated asrepo_id.**kwargs – Keyword arguments forwarded to
huggingface_hub.hf_hub_download. Ifrepo_idhas a registered pin andrevisionis omitted, this helper adds the pinned revision before downloading.
- Returns:
The local path returned by
huggingface_hub.hf_hub_download.- Return type:
str
- Raises:
RuntimeError – If Hugging Face Hub raises while resolving the asset; the original exception is chained with startup-focused context.
- nemo_retriever.utils.hf_model_registry.install_pinned_hf_hub_download(module: Any) None[source]#
Patch an upstream module-level
hf_hub_downloadto use registry pins.- Parameters:
module – Imported upstream module object expected to expose a top-level
hf_hub_downloadfunction. If the attribute is absent, the helper logs a warning and leaves the module unchanged.- Returns:
The module is mutated in place when patching succeeds.
- Return type:
None
nemo_retriever.utils.input_files module#
- nemo_retriever.utils.input_files.expand_input_file_patterns(
- input_paths: str | PathLike[str] | Iterable[str | PathLike[str]],
Expand local path/glob inputs and reject missing or directory local literal paths.
Empty explicit glob matches are allowed so callers can intentionally describe optional file sets.
- nemo_retriever.utils.input_files.input_type_for_path(input_path: str | PathLike[str]) str | None[source]#
Return the supported ingest input family for input_path’s extension.
- nemo_retriever.utils.input_files.raise_input_path_not_found(
- input_path: object,
- cause: BaseException | None = None,
Raise a consistent missing-input-path error.
- Parameters:
input_path – Path, pattern, or list of paths attempted by the caller or file reader.
cause – Optional lower-level exception to preserve as the chained cause.
- Raises:
FileNotFoundError – Always raised with a product-level missing-input-path message.
nemo_retriever.utils.nvtx module#
nemo_retriever.utils.parquet_to_lancedb module#
nemo_retriever.utils.ray_resource_hueristics module#
- pydantic model nemo_retriever.utils.ray_resource_hueristics.ClusterResources[source]#
Bases:
BaseModelDetected compute resources and where they came from.
Show JSON schema
{ "title": "ClusterResources", "description": "Detected compute resources and where they came from.", "type": "object", "properties": { "total_resources": { "$ref": "#/$defs/Resources" }, "available_resources": { "$ref": "#/$defs/Resources" } }, "$defs": { "Resources": { "description": "Resources and where they came from.", "properties": { "cpu_count": { "title": "Cpu Count", "type": "integer" }, "gpu_count": { "title": "Gpu Count", "type": "integer" } }, "required": [ "cpu_count", "gpu_count" ], "title": "Resources", "type": "object" } }, "required": [ "total_resources", "available_resources" ] }
- Config:
frozen: bool = True
- Fields:
- pydantic model nemo_retriever.utils.ray_resource_hueristics.GpuInfo[source]#
Bases:
BaseModelShow JSON schema
{ "title": "GpuInfo", "type": "object", "properties": { "driver_version": { "title": "Driver Version", "type": "string" }, "gpu_name": { "title": "Gpu Name", "type": "string" }, "gpu_uuid": { "title": "Gpu Uuid", "type": "string" }, "gpu_brand": { "title": "Gpu Brand", "type": "string" }, "total_mib": { "title": "Total Mib", "type": "integer" }, "used_mib": { "title": "Used Mib", "type": "integer" }, "free_mib": { "title": "Free Mib", "type": "integer" } }, "required": [ "driver_version", "gpu_name", "gpu_uuid", "gpu_brand", "total_mib", "used_mib", "free_mib" ] }
- Fields:
- field driver_version: str [Required]#
- field free_mib: int [Required]#
- field gpu_brand: str [Required]#
- field gpu_name: str [Required]#
- field gpu_uuid: str [Required]#
- field total_mib: int [Required]#
- field used_mib: int [Required]#
- pydantic model nemo_retriever.utils.ray_resource_hueristics.NodeGpuInfo[source]#
Bases:
BaseModelShow JSON schema
{ "title": "NodeGpuInfo", "type": "object", "properties": { "gpus": { "additionalProperties": { "$ref": "#/$defs/GpuInfo" }, "title": "Gpus", "type": "object" } }, "$defs": { "GpuInfo": { "properties": { "driver_version": { "title": "Driver Version", "type": "string" }, "gpu_name": { "title": "Gpu Name", "type": "string" }, "gpu_uuid": { "title": "Gpu Uuid", "type": "string" }, "gpu_brand": { "title": "Gpu Brand", "type": "string" }, "total_mib": { "title": "Total Mib", "type": "integer" }, "used_mib": { "title": "Used Mib", "type": "integer" }, "free_mib": { "title": "Free Mib", "type": "integer" } }, "required": [ "driver_version", "gpu_name", "gpu_uuid", "gpu_brand", "total_mib", "used_mib", "free_mib" ], "title": "GpuInfo", "type": "object" } }, "required": [ "gpus" ] }
- pydantic model nemo_retriever.utils.ray_resource_hueristics.RequestedPlan[source]#
Bases:
BaseModelContains the requested Ray DAG plan for the batch ingest.
Show JSON schema
{ "title": "RequestedPlan", "description": "Contains the requested Ray DAG plan for the batch ingest.", "type": "object", "properties": { "embed_initial_actors": { "title": "Embed Initial Actors", "type": "integer" }, "embed_min_actors": { "title": "Embed Min Actors", "type": "integer" }, "embed_max_actors": { "title": "Embed Max Actors", "type": "integer" }, "embed_gpus_per_actor": { "title": "Embed Gpus Per Actor", "type": "number" }, "embed_batch_size": { "title": "Embed Batch Size", "type": "integer" }, "nemotron_parse_initial_actors": { "title": "Nemotron Parse Initial Actors", "type": "integer" }, "nemotron_parse_min_actors": { "title": "Nemotron Parse Min Actors", "type": "integer" }, "nemotron_parse_max_actors": { "title": "Nemotron Parse Max Actors", "type": "integer" }, "nemotron_parse_gpus_per_actor": { "title": "Nemotron Parse Gpus Per Actor", "type": "number" }, "nemotron_parse_batch_size": { "title": "Nemotron Parse Batch Size", "type": "integer" }, "ocr_initial_actors": { "title": "Ocr Initial Actors", "type": "integer" }, "ocr_min_actors": { "title": "Ocr Min Actors", "type": "integer" }, "ocr_max_actors": { "title": "Ocr Max Actors", "type": "integer" }, "ocr_gpus_per_actor": { "title": "Ocr Gpus Per Actor", "type": "number" }, "ocr_batch_size": { "title": "Ocr Batch Size", "type": "integer" }, "page_elements_initial_actors": { "title": "Page Elements Initial Actors", "type": "integer" }, "page_elements_min_actors": { "title": "Page Elements Min Actors", "type": "integer" }, "page_elements_max_actors": { "title": "Page Elements Max Actors", "type": "integer" }, "page_elements_gpus_per_actor": { "title": "Page Elements Gpus Per Actor", "type": "number" }, "page_elements_batch_size": { "title": "Page Elements Batch Size", "type": "integer" }, "table_structure_initial_actors": { "title": "Table Structure Initial Actors", "type": "integer" }, "table_structure_min_actors": { "title": "Table Structure Min Actors", "type": "integer" }, "table_structure_max_actors": { "title": "Table Structure Max Actors", "type": "integer" }, "table_structure_gpus_per_actor": { "title": "Table Structure Gpus Per Actor", "type": "number" }, "table_structure_batch_size": { "title": "Table Structure Batch Size", "type": "integer" }, "graphic_elements_initial_actors": { "title": "Graphic Elements Initial Actors", "type": "integer" }, "graphic_elements_min_actors": { "title": "Graphic Elements Min Actors", "type": "integer" }, "graphic_elements_max_actors": { "title": "Graphic Elements Max Actors", "type": "integer" }, "graphic_elements_gpus_per_actor": { "title": "Graphic Elements Gpus Per Actor", "type": "number" }, "graphic_elements_batch_size": { "title": "Graphic Elements Batch Size", "type": "integer" }, "caption_gpus_per_actor": { "title": "Caption Gpus Per Actor", "type": "number" }, "pdf_extract_batch_size": { "title": "Pdf Extract Batch Size", "type": "integer" }, "pdf_extract_cpus_per_task": { "title": "Pdf Extract Cpus Per Task", "type": "number" }, "pdf_extract_tasks": { "title": "Pdf Extract Tasks", "type": "integer" } }, "required": [ "embed_initial_actors", "embed_min_actors", "embed_max_actors", "embed_gpus_per_actor", "embed_batch_size", "nemotron_parse_initial_actors", "nemotron_parse_min_actors", "nemotron_parse_max_actors", "nemotron_parse_gpus_per_actor", "nemotron_parse_batch_size", "ocr_initial_actors", "ocr_min_actors", "ocr_max_actors", "ocr_gpus_per_actor", "ocr_batch_size", "page_elements_initial_actors", "page_elements_min_actors", "page_elements_max_actors", "page_elements_gpus_per_actor", "page_elements_batch_size", "table_structure_initial_actors", "table_structure_min_actors", "table_structure_max_actors", "table_structure_gpus_per_actor", "table_structure_batch_size", "graphic_elements_initial_actors", "graphic_elements_min_actors", "graphic_elements_max_actors", "graphic_elements_gpus_per_actor", "graphic_elements_batch_size", "caption_gpus_per_actor", "pdf_extract_batch_size", "pdf_extract_cpus_per_task", "pdf_extract_tasks" ] }
- Config:
frozen: bool = True
- Fields:
- field caption_gpus_per_actor: float [Required]#
- field embed_batch_size: int [Required]#
- field embed_gpus_per_actor: float [Required]#
- field embed_initial_actors: int [Required]#
- field embed_max_actors: int [Required]#
- field embed_min_actors: int [Required]#
- field graphic_elements_batch_size: int [Required]#
- field graphic_elements_gpus_per_actor: float [Required]#
- field graphic_elements_initial_actors: int [Required]#
- field graphic_elements_max_actors: int [Required]#
- field graphic_elements_min_actors: int [Required]#
- field nemotron_parse_batch_size: int [Required]#
- field nemotron_parse_gpus_per_actor: float [Required]#
- field nemotron_parse_initial_actors: int [Required]#
- field nemotron_parse_max_actors: int [Required]#
- field nemotron_parse_min_actors: int [Required]#
- field ocr_batch_size: int [Required]#
- field ocr_gpus_per_actor: float [Required]#
- field ocr_initial_actors: int [Required]#
- field ocr_max_actors: int [Required]#
- field ocr_min_actors: int [Required]#
- field page_elements_batch_size: int [Required]#
- field page_elements_gpus_per_actor: float [Required]#
- field page_elements_initial_actors: int [Required]#
- field page_elements_max_actors: int [Required]#
- field page_elements_min_actors: int [Required]#
- field pdf_extract_batch_size: int [Required]#
- field pdf_extract_cpus_per_task: float [Required]#
- field pdf_extract_tasks: int [Required]#
- field table_structure_batch_size: int [Required]#
- field table_structure_gpus_per_actor: float [Required]#
- field table_structure_initial_actors: int [Required]#
- field table_structure_max_actors: int [Required]#
- field table_structure_min_actors: int [Required]#
- pydantic model nemo_retriever.utils.ray_resource_hueristics.Resources[source]#
Bases:
BaseModelResources and where they came from.
Show JSON schema
{ "title": "Resources", "description": "Resources and where they came from.", "type": "object", "properties": { "cpu_count": { "title": "Cpu Count", "type": "integer" }, "gpu_count": { "title": "Gpu Count", "type": "integer" } }, "required": [ "cpu_count", "gpu_count" ] }
- Config:
frozen: bool = True
- Fields:
- field cpu_count: int [Required]#
- field gpu_count: int [Required]#
- nemo_retriever.utils.ray_resource_hueristics.gather_cluster_resources(
- ray: object,
Gather total and available CPU/GPU resources from a Ray cluster.
- nemo_retriever.utils.ray_resource_hueristics.gather_local_resources() Resources[source]#
Gather local CPU/GPU resources without requiring Ray.
- nemo_retriever.utils.ray_resource_hueristics.get_gpu_memory_info_remote() object[source]#
Return a Ray ObjectRef for
_get_gpu_memory_infoexecuted remotely.
- nemo_retriever.utils.ray_resource_hueristics.resolve_requested_plan(
- *,
- cluster_resources: ClusterResources,
- override_embed_initial_actors: int | None = None,
- override_embed_min_actors: int | None = None,
- override_embed_max_actors: int | None = None,
- override_embed_gpus_per_actor: float | None = None,
- override_embed_batch_size: int | None = None,
- override_nemotron_parse_initial_actors: int | None = None,
- override_nemotron_parse_min_actors: int | None = None,
- override_nemotron_parse_max_actors: int | None = None,
- override_nemotron_parse_gpus_per_actor: float | None = None,
- override_nemotron_parse_batch_size: int | None = None,
- override_ocr_initial_actors: int | None = None,
- override_ocr_min_actors: int | None = None,
- override_ocr_max_actors: int | None = None,
- override_ocr_gpus_per_actor: float | None = None,
- override_ocr_batch_size: int | None = None,
- override_page_elements_initial_actors: int | None = None,
- override_page_elements_min_actors: int | None = None,
- override_page_elements_max_actors: int | None = None,
- override_page_elements_gpus_per_actor: float | None = None,
- override_page_elements_batch_size: int | None = None,
- override_table_structure_initial_actors: int | None = None,
- override_table_structure_min_actors: int | None = None,
- override_table_structure_max_actors: int | None = None,
- override_table_structure_gpus_per_actor: float | None = None,
- override_table_structure_batch_size: int | None = None,
- override_graphic_elements_initial_actors: int | None = None,
- override_graphic_elements_min_actors: int | None = None,
- override_graphic_elements_max_actors: int | None = None,
- override_graphic_elements_gpus_per_actor: float | None = None,
- override_graphic_elements_batch_size: int | None = None,
- override_pdf_extract_batch_size: int | None = None,
- override_pdf_extract_cpus_per_task: float | None = None,
- override_pdf_extract_tasks: int | None = None,
- allow_no_gpu: bool = False,
- caption_enabled: bool = False,
- override_caption_gpus_per_actor: float | None = None,
nemo_retriever.utils.remote_auth module#
nemo_retriever.utils.table_and_chart module#
Table/chart/infographic content reconstruction utilities.
Ports bbox-matching and content-reconstruction algorithms from
nemo_retriever.api.util.image_processing.table_and_chart and adds adapter
functions that convert the retriever’s detection/OCR formats into the
pixel-coordinate representations expected by the core joining routines.
- nemo_retriever.utils.table_and_chart.assign_boxes(
- ocr_box: ndarray,
- boxes: ndarray,
- delta: float = 2.0,
- min_overlap: float = 0.25,
Area-normalized overlap matching for table structure.
- nemo_retriever.utils.table_and_chart.build_markdown(df: DataFrame) list[source]#
Convert a dataframe with row_ids/col_ids/text into a markdown matrix.
- nemo_retriever.utils.table_and_chart.display_markdown(data: list, use_header: bool = False) str[source]#
Convert a list-of-lists into a markdown table string.
- nemo_retriever.utils.table_and_chart.join_graphic_elements_and_ocr_output(
- ge_dets: List[Dict[str, Any]],
- ocr_preds: Any,
- crop_hw: Tuple[int, int],
Adapter: convert retriever graphic-elements detections + OCR items, then call the core joining + concatenation functions.
- Parameters:
ge_dets (list[dict]) – From
_prediction_to_detections()with chart-element label_names andbbox_xyxy_normin [0, 1].ocr_preds (list | dict) – Raw OCR output from
NemotronOCRV1.invoke().crop_hw ((int, int)) –
(H, W)of the crop image.
- nemo_retriever.utils.table_and_chart.join_table_structure_and_ocr_output(
- structure_dets: List[Dict[str, Any]],
- ocr_preds: Any,
- crop_hw: Tuple[int, int],
Adapter: convert retriever table-structure detections + OCR items, then call the core joining function.
- Parameters:
structure_dets (list[dict]) – From
_prediction_to_detections()with label_names cell/row/column andbbox_xyxy_normin [0, 1].ocr_preds (list | dict) – Raw OCR output from
NemotronOCRV1.invoke().crop_hw ((int, int)) –
(H, W)of the crop image.
- nemo_retriever.utils.table_and_chart.match_bboxes(
- yolox_box: ndarray,
- ocr_boxes: ndarray,
- already_matched: list | None = None,
- delta: float = 2.0,
Union-based IoU matching for chart graphic elements.
- nemo_retriever.utils.table_and_chart.merge_text_in_cell(
- df_cell: DataFrame,
Merge text from multiple OCR items inside one table cell.
- nemo_retriever.utils.table_and_chart.process_yolox_graphic_elements(
- yolox_text_dict: Dict[str, str],
Concatenate chart text by semantic region.
- nemo_retriever.utils.table_and_chart.remove_empty_row(mat: list) list[source]#
Remove empty rows from a matrix.
Module contents#
Utility command/tooling subpackages for nemo_retriever.