nemo_retriever.utils package#

Subpackages#

Submodules#

nemo_retriever.utils.detection_summary module#

Shared detection summary logic.

Provides a single function that accumulates per-page detection counters from an iterable of (page_key, metadata_dict, row_dict) tuples. Both the batch pipeline (reading from LanceDB) and inprocess pipeline (reading from a DataFrame) can produce these tuples, allowing the summary computation to be shared.

nemo_retriever.utils.detection_summary.collect_detection_summary_from_df( df, ) → Dict[str, Any][source]#: Collect detection summary from a pandas DataFrame.

nemo_retriever.utils.detection_summary.collect_detection_summary_from_lancedb( uri: str, table_name: str, ) → Dict[str, Any] | None[source]#: Collect detection summary from a LanceDB table.

nemo_retriever.utils.detection_summary.compute_detection_summary( rows: Iterable[Tuple[Any, Dict[str, Any], Dict[str, Any]]], ) → Dict[str, Any][source]#

Compute deduped detection totals from an iterable of page data.

Each element is (page_key, metadata_dict, row_dict) where:

page_key is a hashable value used to deduplicate exploded content rows (e.g. (source_id, page_number)).
metadata_dict is the parsed JSON metadata (may contain counters from the LanceDB metadata column or from direct DataFrame columns).
row_dict is the raw row dict, used as fallback for counters stored as top-level DataFrame columns (e.g. table, chart lists).

nemo_retriever.utils.detection_summary.iter_dataframe_rows(df)[source]#: Yield (page_key, meta, row_dict) tuples from a pandas DataFrame.

nemo_retriever.utils.detection_summary.print_detection_summary( summary: Dict[str, Any] | None, ) → None[source]#: Print a detection summary to stdout.

nemo_retriever.utils.detection_summary.print_pages_per_second( processed_pages: int | None, ingest_elapsed_s: float, ) → None[source]#: Print pages-per-second throughput to stdout.

nemo_retriever.utils.detection_summary.print_run_summary( processed_pages: int | None, input_path: Path, vdb_op: str, vdb_kwargs: Dict[str, Any] | None, total_time: float, ingest_only_total_time: float, ray_dataset_download_total_time: float, vdb_upload_total_time: float, evaluation_total_time: float = 0.0, evaluation_metrics: Dict[str, float] | None = None, recall_total_time: float = 0.0, recall_metrics: Dict[str, float] | None = None, processed_files: int | None = None, evaluation_label: str = 'Recall', evaluation_count: int | None = None, ) → Dict[str, Any][source]#

Print a human-readable run summary and return all metrics as a dict.

The returned dict is the authoritative structured representation of every metric collected during the run. Callers should persist it to a JSON file so that the harness can read it directly instead of parsing stdout.

nemo_retriever.utils.detection_summary.write_detection_summary( path: Path, summary: Dict[str, Any] | None, ) → None[source]#: Write a detection summary dict to a JSON file.

nemo_retriever.utils.hf_cache module#

nemo_retriever.utils.hf_cache.collect_hf_runtime_env( *, default_hf_hub_offline: str = '0', extra_keys: Iterable[str] = (), ) → dict[str, str][source]#

Collect HF-related environment variables to forward to Ray workers.

Parameters:

default_hf_hub_offline – Value to emit for HF_HUB_OFFLINE when it is not set in the parent process environment. The default keeps online Hub checks enabled.
extra_keys – Additional environment variable names to forward if they are set. Duplicates of built-in keys are ignored after their first occurrence.

Returns:

Environment variables for Ray runtime_env["env_vars"]. Explicitly blank environment values are preserved.

Return type:

dict[str, str]

nemo_retriever.utils.hf_cache.configure_global_hf_cache_base( explicit_hf_cache_dir: str | None = None, ) → str[source]#: Apply resolved HF cache base to standard Hugging Face env vars.

nemo_retriever.utils.hf_cache.resolve_hf_cache_dir(explicit_hf_cache_dir: str | None = None) → str[source]#: Resolve Hugging Face cache dir from explicit arg, env, then default.

nemo_retriever.utils.hf_model_registry module#

Central registry of pinned HuggingFace model revisions.

Every from_pretrained call in the codebase should pass revision=get_hf_revision(model_id) and direct hf_hub_download calls should use hf_hub_download_with_pinned_revision so that we always download an exact, immutable snapshot rather than tracking the mutable main branch.

To bump a model version, update the corresponding SHA in HF_MODEL_REVISIONS and re-test.

nemo_retriever.utils.hf_model_registry.get_hf_revision(model_id: str, *, strict: bool = True) → str | None[source]#

Return the pinned commit SHA for model_id.

Parameters:

model_id – HuggingFace model identifier (e.g. "nvidia/parakeet-ctc-1.1b").
strict – When True (the default), raise ValueError if model_id has no pinned revision. When False, log a warning and return None so that from_pretrained falls back to the main branch.

nemo_retriever.utils.hf_model_registry.hf_hub_download(*args: Any, **kwargs: Any) → str[source]#: Proxy to Hugging Face’s downloader, imported lazily.

nemo_retriever.utils.hf_model_registry.hf_hub_download_with_pinned_revision(

*args: Any,

**kwargs: Any,

) → str[source]#

Call hf_hub_download with a registry revision when one is known.

Parameters:

*args – Positional arguments forwarded to huggingface_hub.hf_hub_download. When present, the first positional argument is treated as repo_id.
**kwargs – Keyword arguments forwarded to huggingface_hub.hf_hub_download. If repo_id has a registered pin and revision is omitted, this helper adds the pinned revision before downloading.

Returns:

The local path returned by huggingface_hub.hf_hub_download.

Return type:

str

Raises:

RuntimeError – If Hugging Face Hub raises while resolving the asset; the original exception is chained with startup-focused context.

nemo_retriever.utils.hf_model_registry.install_pinned_hf_hub_download(module: Any) → None[source]#

Patch an upstream module-level hf_hub_download to use registry pins.

Parameters:: module – Imported upstream module object expected to expose a top-level hf_hub_download function. If the attribute is absent, the helper logs a warning and leaves the module unchanged.
Returns:: The module is mutated in place when patching succeeds.
Return type:: None

nemo_retriever.utils.input_files module#

nemo_retriever.utils.input_files.expand_input_file_patterns( input_paths: str | PathLike[str] | Iterable[str | PathLike[str]], ) → list[str][source]#

Expand local path/glob inputs and reject missing or directory local literal paths.

Empty explicit glob matches are allowed so callers can intentionally describe optional file sets.

nemo_retriever.utils.input_files.input_type_for_path(input_path: str | PathLike[str]) → str | None[source]#: Return the supported ingest input family for input_path’s extension.

nemo_retriever.utils.input_files.raise_input_path_not_found( input_path: object, cause: BaseException | None = None, ) → NoReturn[source]#

Raise a consistent missing-input-path error.

Parameters:

input_path – Path, pattern, or list of paths attempted by the caller or file reader.
cause – Optional lower-level exception to preserve as the chained cause.

Raises:

FileNotFoundError – Always raised with a product-level missing-input-path message.

nemo_retriever.utils.input_files.resolve_input_files( input_path: Path, input_type: str, ) → list[Path][source]#

nemo_retriever.utils.input_files.resolve_input_patterns( input_path: Path, input_type: str, ) → list[str][source]#

nemo_retriever.utils.nvtx module#

nemo_retriever.utils.parquet_to_lancedb module#

nemo_retriever.utils.ray_resource_hueristics module#

pydantic model nemo_retriever.utils.ray_resource_hueristics.ClusterResources[source]#

Bases: BaseModel

Detected compute resources and where they came from.

Show JSON schema

{
   "title": "ClusterResources",
   "description": "Detected compute resources and where they came from.",
   "type": "object",
   "properties": {
      "total_resources": {
         "$ref": "#/$defs/Resources"
      },
      "available_resources": {
         "$ref": "#/$defs/Resources"
      }
   },
   "$defs": {
      "Resources": {
         "description": "Resources and where they came from.",
         "properties": {
            "cpu_count": {
               "title": "Cpu Count",
               "type": "integer"
            },
            "gpu_count": {
               "title": "Gpu Count",
               "type": "integer"
            }
         },
         "required": [
            "cpu_count",
            "gpu_count"
         ],
         "title": "Resources",
         "type": "object"
      }
   },
   "required": [
      "total_resources",
      "available_resources"
   ]
}

Config:

frozen: bool = True

Fields:

available_resources (nemo_retriever.utils.ray_resource_hueristics.Resources)
total_resources (nemo_retriever.utils.ray_resource_hueristics.Resources)

field available_resources: Resources [Required]#

field total_resources: Resources [Required]#

available_cpu_count() → int[source]#

available_gpu_count() → int[source]#

total_cpu_count() → int[source]#

total_gpu_count() → int[source]#

pydantic model nemo_retriever.utils.ray_resource_hueristics.GpuInfo[source]#

Bases: BaseModel

Show JSON schema

{
   "title": "GpuInfo",
   "type": "object",
   "properties": {
      "driver_version": {
         "title": "Driver Version",
         "type": "string"
      },
      "gpu_name": {
         "title": "Gpu Name",
         "type": "string"
      },
      "gpu_uuid": {
         "title": "Gpu Uuid",
         "type": "string"
      },
      "gpu_brand": {
         "title": "Gpu Brand",
         "type": "string"
      },
      "total_mib": {
         "title": "Total Mib",
         "type": "integer"
      },
      "used_mib": {
         "title": "Used Mib",
         "type": "integer"
      },
      "free_mib": {
         "title": "Free Mib",
         "type": "integer"
      }
   },
   "required": [
      "driver_version",
      "gpu_name",
      "gpu_uuid",
      "gpu_brand",
      "total_mib",
      "used_mib",
      "free_mib"
   ]
}

Fields:

driver_version (str)
free_mib (int)
gpu_brand (str)
gpu_name (str)
gpu_uuid (str)
total_mib (int)
used_mib (int)

field driver_version: str [Required]#

field free_mib: int [Required]#

field gpu_brand: str [Required]#

field gpu_name: str [Required]#

field gpu_uuid: str [Required]#

field total_mib: int [Required]#

field used_mib: int [Required]#

pydantic model nemo_retriever.utils.ray_resource_hueristics.NodeGpuInfo[source]#

Bases: BaseModel

Show JSON schema

{
   "title": "NodeGpuInfo",
   "type": "object",
   "properties": {
      "gpus": {
         "additionalProperties": {
            "$ref": "#/$defs/GpuInfo"
         },
         "title": "Gpus",
         "type": "object"
      }
   },
   "$defs": {
      "GpuInfo": {
         "properties": {
            "driver_version": {
               "title": "Driver Version",
               "type": "string"
            },
            "gpu_name": {
               "title": "Gpu Name",
               "type": "string"
            },
            "gpu_uuid": {
               "title": "Gpu Uuid",
               "type": "string"
            },
            "gpu_brand": {
               "title": "Gpu Brand",
               "type": "string"
            },
            "total_mib": {
               "title": "Total Mib",
               "type": "integer"
            },
            "used_mib": {
               "title": "Used Mib",
               "type": "integer"
            },
            "free_mib": {
               "title": "Free Mib",
               "type": "integer"
            }
         },
         "required": [
            "driver_version",
            "gpu_name",
            "gpu_uuid",
            "gpu_brand",
            "total_mib",
            "used_mib",
            "free_mib"
         ],
         "title": "GpuInfo",
         "type": "object"
      }
   },
   "required": [
      "gpus"
   ]
}

Fields:

gpus (dict[int, nemo_retriever.utils.ray_resource_hueristics.GpuInfo])

field gpus: dict[int, GpuInfo] [Required]#

pydantic model nemo_retriever.utils.ray_resource_hueristics.RequestedPlan[source]#

Bases: BaseModel

Contains the requested Ray DAG plan for the batch ingest.

Show JSON schema

{
   "title": "RequestedPlan",
   "description": "Contains the requested Ray DAG plan for the batch ingest.",
   "type": "object",
   "properties": {
      "embed_initial_actors": {
         "title": "Embed Initial Actors",
         "type": "integer"
      },
      "embed_min_actors": {
         "title": "Embed Min Actors",
         "type": "integer"
      },
      "embed_max_actors": {
         "title": "Embed Max Actors",
         "type": "integer"
      },
      "embed_gpus_per_actor": {
         "title": "Embed Gpus Per Actor",
         "type": "number"
      },
      "embed_batch_size": {
         "title": "Embed Batch Size",
         "type": "integer"
      },
      "nemotron_parse_initial_actors": {
         "title": "Nemotron Parse Initial Actors",
         "type": "integer"
      },
      "nemotron_parse_min_actors": {
         "title": "Nemotron Parse Min Actors",
         "type": "integer"
      },
      "nemotron_parse_max_actors": {
         "title": "Nemotron Parse Max Actors",
         "type": "integer"
      },
      "nemotron_parse_gpus_per_actor": {
         "title": "Nemotron Parse Gpus Per Actor",
         "type": "number"
      },
      "nemotron_parse_batch_size": {
         "title": "Nemotron Parse Batch Size",
         "type": "integer"
      },
      "ocr_initial_actors": {
         "title": "Ocr Initial Actors",
         "type": "integer"
      },
      "ocr_min_actors": {
         "title": "Ocr Min Actors",
         "type": "integer"
      },
      "ocr_max_actors": {
         "title": "Ocr Max Actors",
         "type": "integer"
      },
      "ocr_gpus_per_actor": {
         "title": "Ocr Gpus Per Actor",
         "type": "number"
      },
      "ocr_batch_size": {
         "title": "Ocr Batch Size",
         "type": "integer"
      },
      "page_elements_initial_actors": {
         "title": "Page Elements Initial Actors",
         "type": "integer"
      },
      "page_elements_min_actors": {
         "title": "Page Elements Min Actors",
         "type": "integer"
      },
      "page_elements_max_actors": {
         "title": "Page Elements Max Actors",
         "type": "integer"
      },
      "page_elements_gpus_per_actor": {
         "title": "Page Elements Gpus Per Actor",
         "type": "number"
      },
      "page_elements_batch_size": {
         "title": "Page Elements Batch Size",
         "type": "integer"
      },
      "table_structure_initial_actors": {
         "title": "Table Structure Initial Actors",
         "type": "integer"
      },
      "table_structure_min_actors": {
         "title": "Table Structure Min Actors",
         "type": "integer"
      },
      "table_structure_max_actors": {
         "title": "Table Structure Max Actors",
         "type": "integer"
      },
      "table_structure_gpus_per_actor": {
         "title": "Table Structure Gpus Per Actor",
         "type": "number"
      },
      "table_structure_batch_size": {
         "title": "Table Structure Batch Size",
         "type": "integer"
      },
      "graphic_elements_initial_actors": {
         "title": "Graphic Elements Initial Actors",
         "type": "integer"
      },
      "graphic_elements_min_actors": {
         "title": "Graphic Elements Min Actors",
         "type": "integer"
      },
      "graphic_elements_max_actors": {
         "title": "Graphic Elements Max Actors",
         "type": "integer"
      },
      "graphic_elements_gpus_per_actor": {
         "title": "Graphic Elements Gpus Per Actor",
         "type": "number"
      },
      "graphic_elements_batch_size": {
         "title": "Graphic Elements Batch Size",
         "type": "integer"
      },
      "caption_gpus_per_actor": {
         "title": "Caption Gpus Per Actor",
         "type": "number"
      },
      "pdf_extract_batch_size": {
         "title": "Pdf Extract Batch Size",
         "type": "integer"
      },
      "pdf_extract_cpus_per_task": {
         "title": "Pdf Extract Cpus Per Task",
         "type": "number"
      },
      "pdf_extract_tasks": {
         "title": "Pdf Extract Tasks",
         "type": "integer"
      }
   },
   "required": [
      "embed_initial_actors",
      "embed_min_actors",
      "embed_max_actors",
      "embed_gpus_per_actor",
      "embed_batch_size",
      "nemotron_parse_initial_actors",
      "nemotron_parse_min_actors",
      "nemotron_parse_max_actors",
      "nemotron_parse_gpus_per_actor",
      "nemotron_parse_batch_size",
      "ocr_initial_actors",
      "ocr_min_actors",
      "ocr_max_actors",
      "ocr_gpus_per_actor",
      "ocr_batch_size",
      "page_elements_initial_actors",
      "page_elements_min_actors",
      "page_elements_max_actors",
      "page_elements_gpus_per_actor",
      "page_elements_batch_size",
      "table_structure_initial_actors",
      "table_structure_min_actors",
      "table_structure_max_actors",
      "table_structure_gpus_per_actor",
      "table_structure_batch_size",
      "graphic_elements_initial_actors",
      "graphic_elements_min_actors",
      "graphic_elements_max_actors",
      "graphic_elements_gpus_per_actor",
      "graphic_elements_batch_size",
      "caption_gpus_per_actor",
      "pdf_extract_batch_size",
      "pdf_extract_cpus_per_task",
      "pdf_extract_tasks"
   ]
}

Config:

frozen: bool = True

Fields:

caption_gpus_per_actor (float)
embed_batch_size (int)
embed_gpus_per_actor (float)
embed_initial_actors (int)
embed_max_actors (int)
embed_min_actors (int)
graphic_elements_batch_size (int)
graphic_elements_gpus_per_actor (float)
graphic_elements_initial_actors (int)
graphic_elements_max_actors (int)
graphic_elements_min_actors (int)
nemotron_parse_batch_size (int)
nemotron_parse_gpus_per_actor (float)
nemotron_parse_initial_actors (int)
nemotron_parse_max_actors (int)
nemotron_parse_min_actors (int)
ocr_batch_size (int)
ocr_gpus_per_actor (float)
ocr_initial_actors (int)
ocr_max_actors (int)
ocr_min_actors (int)
page_elements_batch_size (int)
page_elements_gpus_per_actor (float)
page_elements_initial_actors (int)
page_elements_max_actors (int)
page_elements_min_actors (int)
pdf_extract_batch_size (int)
pdf_extract_cpus_per_task (float)
pdf_extract_tasks (int)
table_structure_batch_size (int)
table_structure_gpus_per_actor (float)
table_structure_initial_actors (int)
table_structure_max_actors (int)
table_structure_min_actors (int)

field caption_gpus_per_actor: float [Required]#

field embed_batch_size: int [Required]#

field embed_gpus_per_actor: float [Required]#

field embed_initial_actors: int [Required]#

field embed_max_actors: int [Required]#

field embed_min_actors: int [Required]#

field graphic_elements_batch_size: int [Required]#

field graphic_elements_gpus_per_actor: float [Required]#

field graphic_elements_initial_actors: int [Required]#

field graphic_elements_max_actors: int [Required]#

field graphic_elements_min_actors: int [Required]#

field nemotron_parse_batch_size: int [Required]#

field nemotron_parse_gpus_per_actor: float [Required]#

field nemotron_parse_initial_actors: int [Required]#

field nemotron_parse_max_actors: int [Required]#

field nemotron_parse_min_actors: int [Required]#

field ocr_batch_size: int [Required]#

field ocr_gpus_per_actor: float [Required]#

field ocr_initial_actors: int [Required]#

field ocr_max_actors: int [Required]#

field ocr_min_actors: int [Required]#

field page_elements_batch_size: int [Required]#

field page_elements_gpus_per_actor: float [Required]#

field page_elements_initial_actors: int [Required]#

field page_elements_max_actors: int [Required]#

field page_elements_min_actors: int [Required]#

field pdf_extract_batch_size: int [Required]#

field pdf_extract_cpus_per_task: float [Required]#

field pdf_extract_tasks: int [Required]#

field table_structure_batch_size: int [Required]#

field table_structure_gpus_per_actor: float [Required]#

field table_structure_initial_actors: int [Required]#

field table_structure_max_actors: int [Required]#

field table_structure_min_actors: int [Required]#

get_embed_batch_size() → int[source]#

get_embed_gpus_per_actor() → float[source]#

get_embed_initial_actors() → int[source]#

get_embed_max_actors() → int[source]#

get_embed_min_actors() → int[source]#

get_graphic_elements_batch_size() → int[source]#

get_graphic_elements_gpus_per_actor() → float[source]#

get_graphic_elements_initial_actors() → int[source]#

get_graphic_elements_max_actors() → int[source]#

get_graphic_elements_min_actors() → int[source]#

get_nemotron_parse_batch_size() → int[source]#

get_nemotron_parse_gpus_per_actor() → float[source]#

get_nemotron_parse_initial_actors() → int[source]#

get_nemotron_parse_max_actors() → int[source]#

get_nemotron_parse_min_actors() → int[source]#

get_ocr_batch_size() → int[source]#

get_ocr_gpus_per_actor() → float[source]#

get_ocr_initial_actors() → int[source]#

get_ocr_max_actors() → int[source]#

get_ocr_min_actors() → int[source]#

get_page_elements_batch_size() → int[source]#

get_page_elements_gpus_per_actor() → float[source]#

get_page_elements_initial_actors() → int[source]#

get_page_elements_max_actors() → int[source]#

get_page_elements_min_actors() → int[source]#

get_pdf_extract_batch_size() → int[source]#

get_pdf_extract_cpus_per_task() → float[source]#

get_pdf_extract_tasks() → int[source]#

get_table_structure_batch_size() → int[source]#

get_table_structure_gpus_per_actor() → float[source]#

get_table_structure_initial_actors() → int[source]#

get_table_structure_max_actors() → int[source]#

get_table_structure_min_actors() → int[source]#

pydantic model nemo_retriever.utils.ray_resource_hueristics.Resources[source]#

Bases: BaseModel

Resources and where they came from.

Show JSON schema

{
   "title": "Resources",
   "description": "Resources and where they came from.",
   "type": "object",
   "properties": {
      "cpu_count": {
         "title": "Cpu Count",
         "type": "integer"
      },
      "gpu_count": {
         "title": "Gpu Count",
         "type": "integer"
      }
   },
   "required": [
      "cpu_count",
      "gpu_count"
   ]
}

Config:

frozen: bool = True

Fields:

cpu_count (int)
gpu_count (int)

field cpu_count: int [Required]#

field gpu_count: int [Required]#

nemo_retriever.utils.ray_resource_hueristics.gather_cluster_resources( ray: object, ) → ClusterResources[source]#: Gather total and available CPU/GPU resources from a Ray cluster.

nemo_retriever.utils.ray_resource_hueristics.gather_local_resources() → Resources[source]#: Gather local CPU/GPU resources without requiring Ray.

nemo_retriever.utils.ray_resource_hueristics.get_gpu_memory_info_remote() → object[source]#: Return a Ray ObjectRef for _get_gpu_memory_info executed remotely.

nemo_retriever.utils.ray_resource_hueristics.resolve_requested_plan( *, cluster_resources: ClusterResources, override_embed_initial_actors: int | None = None, override_embed_min_actors: int | None = None, override_embed_max_actors: int | None = None, override_embed_gpus_per_actor: float | None = None, override_embed_batch_size: int | None = None, override_nemotron_parse_initial_actors: int | None = None, override_nemotron_parse_min_actors: int | None = None, override_nemotron_parse_max_actors: int | None = None, override_nemotron_parse_gpus_per_actor: float | None = None, override_nemotron_parse_batch_size: int | None = None, override_ocr_initial_actors: int | None = None, override_ocr_min_actors: int | None = None, override_ocr_max_actors: int | None = None, override_ocr_gpus_per_actor: float | None = None, override_ocr_batch_size: int | None = None, override_page_elements_initial_actors: int | None = None, override_page_elements_min_actors: int | None = None, override_page_elements_max_actors: int | None = None, override_page_elements_gpus_per_actor: float | None = None, override_page_elements_batch_size: int | None = None, override_table_structure_initial_actors: int | None = None, override_table_structure_min_actors: int | None = None, override_table_structure_max_actors: int | None = None, override_table_structure_gpus_per_actor: float | None = None, override_table_structure_batch_size: int | None = None, override_graphic_elements_initial_actors: int | None = None, override_graphic_elements_min_actors: int | None = None, override_graphic_elements_max_actors: int | None = None, override_graphic_elements_gpus_per_actor: float | None = None, override_graphic_elements_batch_size: int | None = None, override_pdf_extract_batch_size: int | None = None, override_pdf_extract_cpus_per_task: float | None = None, override_pdf_extract_tasks: int | None = None, allow_no_gpu: bool = False, caption_enabled: bool = False, override_caption_gpus_per_actor: float | None = None, ) → RequestedPlan[source]#

nemo_retriever.utils.remote_auth module#

nemo_retriever.utils.remote_auth.collect_remote_auth_runtime_env( *, extra_keys: Iterable[str] = (), ) → dict[str, str][source]#: Collect non-HF remote auth env vars historically forwarded to Ray workers.

nemo_retriever.utils.remote_auth.resolve_remote_api_key( explicit_api_key: str | None = None, ) → str | None[source]#: Resolve bearer token for hosted NIM endpoints.

nemo_retriever.utils.table_and_chart module#

Table/chart/infographic content reconstruction utilities.

Ports bbox-matching and content-reconstruction algorithms from nemo_retriever.api.util.image_processing.table_and_chart and adds adapter functions that convert the retriever’s detection/OCR formats into the pixel-coordinate representations expected by the core joining routines.

nemo_retriever.utils.table_and_chart.assign_boxes( ocr_box: ndarray, boxes: ndarray, delta: float = 2.0, min_overlap: float = 0.25, ) → ndarray[source]#: Area-normalized overlap matching for table structure.

nemo_retriever.utils.table_and_chart.build_markdown(df: DataFrame) → list[source]#: Convert a dataframe with row_ids/col_ids/text into a markdown matrix.

nemo_retriever.utils.table_and_chart.display_markdown(data: list, use_header: bool = False) → str[source]#: Convert a list-of-lists into a markdown table string.

nemo_retriever.utils.table_and_chart.join_graphic_elements_and_ocr_output( ge_dets: List[Dict[str, Any]], ocr_preds: Any, crop_hw: Tuple[int, int], ) → str[source]#

Adapter: convert retriever graphic-elements detections + OCR items, then call the core joining + concatenation functions.

Parameters:

ge_dets (list[dict]) – From _prediction_to_detections() with chart-element label_names and bbox_xyxy_norm in [0, 1].
ocr_preds (list | dict) – Raw OCR output from NemotronOCRV1.invoke().
crop_hw ((int, int)) – (H, W) of the crop image.

nemo_retriever.utils.table_and_chart.join_table_structure_and_ocr_output( structure_dets: List[Dict[str, Any]], ocr_preds: Any, crop_hw: Tuple[int, int], ) → str[source]#

Adapter: convert retriever table-structure detections + OCR items, then call the core joining function.

Parameters:

structure_dets (list[dict]) – From _prediction_to_detections() with label_names cell/row/column and bbox_xyxy_norm in [0, 1].
ocr_preds (list | dict) – Raw OCR output from NemotronOCRV1.invoke().
crop_hw ((int, int)) – (H, W) of the crop image.

nemo_retriever.utils.table_and_chart.match_bboxes( yolox_box: ndarray, ocr_boxes: ndarray, already_matched: list | None = None, delta: float = 2.0, ) → ndarray[source]#: Union-based IoU matching for chart graphic elements.

nemo_retriever.utils.table_and_chart.merge_text_in_cell( df_cell: DataFrame, ) → DataFrame[source]#: Merge text from multiple OCR items inside one table cell.

nemo_retriever.utils.table_and_chart.process_yolox_graphic_elements( yolox_text_dict: Dict[str, str], ) → str[source]#: Concatenate chart text by semantic region.

nemo_retriever.utils.table_and_chart.remove_empty_row(mat: list) → list[source]#: Remove empty rows from a matrix.

nemo_retriever.utils.table_and_chart.reorder_boxes( boxes: ndarray, texts: list, confs: list, mode: str = 'top_left', dbscan_eps: float = 10, ) → Tuple[list, list, list][source]#: Reorder OCR boxes in reading order using DBSCAN clustering.

nemo_retriever.utils.table_and_chart.reorder_ocr_for_infographic( ocr_preds: Any, crop_hw: Tuple[int, int], ) → str[source]#: Adapter: convert OCR items to pixel-coord quad boxes, reorder in reading order, and return joined text.

Module contents#

Utility command/tooling subpackages for nemo_retriever.