nemo_retriever.utils package#

Subpackages#

Submodules#

nemo_retriever.utils.detection_summary module#

Shared detection summary logic.

Provides a single function that accumulates per-page detection counters from an iterable of (page_key, metadata_dict, row_dict) tuples. Both the batch pipeline (reading from LanceDB) and inprocess pipeline (reading from a DataFrame) can produce these tuples, allowing the summary computation to be shared.

nemo_retriever.utils.detection_summary.collect_detection_summary_from_df(
df,
) Dict[str, Any][source]#

Collect detection summary from a pandas DataFrame.

nemo_retriever.utils.detection_summary.collect_detection_summary_from_lancedb(
uri: str,
table_name: str,
) Dict[str, Any] | None[source]#

Collect detection summary from a LanceDB table.

nemo_retriever.utils.detection_summary.compute_detection_summary(
rows: Iterable[Tuple[Any, Dict[str, Any], Dict[str, Any]]],
) Dict[str, Any][source]#

Compute deduped detection totals from an iterable of page data.

Each element is (page_key, metadata_dict, row_dict) where:

  • page_key is a hashable value used to deduplicate exploded content rows (e.g. (source_id, page_number)).

  • metadata_dict is the parsed JSON metadata (may contain counters from the LanceDB metadata column or from direct DataFrame columns).

  • row_dict is the raw row dict, used as fallback for counters stored as top-level DataFrame columns (e.g. table, chart lists).

nemo_retriever.utils.detection_summary.iter_dataframe_rows(df)[source]#

Yield (page_key, meta, row_dict) tuples from a pandas DataFrame.

nemo_retriever.utils.detection_summary.print_detection_summary(
summary: Dict[str, Any] | None,
) None[source]#

Print a detection summary to stdout.

nemo_retriever.utils.detection_summary.print_pages_per_second(
processed_pages: int | None,
ingest_elapsed_s: float,
) None[source]#

Print pages-per-second throughput to stdout.

nemo_retriever.utils.detection_summary.print_run_summary(
processed_pages: int | None,
input_path: Path,
vdb_op: str,
vdb_kwargs: Dict[str, Any] | None,
total_time: float,
ingest_only_total_time: float,
ray_dataset_download_total_time: float,
vdb_upload_total_time: float,
evaluation_total_time: float = 0.0,
evaluation_metrics: Dict[str, float] | None = None,
recall_total_time: float = 0.0,
recall_metrics: Dict[str, float] | None = None,
processed_files: int | None = None,
evaluation_label: str = 'Recall',
evaluation_count: int | None = None,
) Dict[str, Any][source]#

Print a human-readable run summary and return all metrics as a dict.

The returned dict is the authoritative structured representation of every metric collected during the run. Callers should persist it to a JSON file so that the harness can read it directly instead of parsing stdout.

nemo_retriever.utils.detection_summary.write_detection_summary(
path: Path,
summary: Dict[str, Any] | None,
) None[source]#

Write a detection summary dict to a JSON file.

nemo_retriever.utils.hf_cache module#

nemo_retriever.utils.hf_cache.collect_hf_runtime_env(
*,
default_hf_hub_offline: str = '0',
extra_keys: Iterable[str] = (),
) dict[str, str][source]#

Collect HF-related environment variables to forward to Ray workers.

Parameters:
  • default_hf_hub_offline – Value to emit for HF_HUB_OFFLINE when it is not set in the parent process environment. The default keeps online Hub checks enabled.

  • extra_keys – Additional environment variable names to forward if they are set. Duplicates of built-in keys are ignored after their first occurrence.

Returns:

Environment variables for Ray runtime_env["env_vars"]. Explicitly blank environment values are preserved.

Return type:

dict[str, str]

nemo_retriever.utils.hf_cache.configure_global_hf_cache_base(
explicit_hf_cache_dir: str | None = None,
) str[source]#

Apply resolved HF cache base to standard Hugging Face env vars.

nemo_retriever.utils.hf_cache.resolve_hf_cache_dir(explicit_hf_cache_dir: str | None = None) str[source]#

Resolve Hugging Face cache dir from explicit arg, env, then default.

nemo_retriever.utils.hf_model_registry module#

Central registry of pinned HuggingFace model revisions.

Every from_pretrained call in the codebase should pass revision=get_hf_revision(model_id) and direct hf_hub_download calls should use hf_hub_download_with_pinned_revision so that we always download an exact, immutable snapshot rather than tracking the mutable main branch.

To bump a model version, update the corresponding SHA in HF_MODEL_REVISIONS and re-test.

nemo_retriever.utils.hf_model_registry.get_hf_revision(model_id: str, *, strict: bool = True) str | None[source]#

Return the pinned commit SHA for model_id.

Parameters:
  • model_id – HuggingFace model identifier (e.g. "nvidia/parakeet-ctc-1.1b").

  • strict – When True (the default), raise ValueError if model_id has no pinned revision. When False, log a warning and return None so that from_pretrained falls back to the main branch.

nemo_retriever.utils.hf_model_registry.hf_hub_download(*args: Any, **kwargs: Any) str[source]#

Proxy to Hugging Face’s downloader, imported lazily.

nemo_retriever.utils.hf_model_registry.hf_hub_download_with_pinned_revision(
*args: Any,
**kwargs: Any,
) str[source]#

Call hf_hub_download with a registry revision when one is known.

Parameters:
  • *args – Positional arguments forwarded to huggingface_hub.hf_hub_download. When present, the first positional argument is treated as repo_id.

  • **kwargs – Keyword arguments forwarded to huggingface_hub.hf_hub_download. If repo_id has a registered pin and revision is omitted, this helper adds the pinned revision before downloading.

Returns:

The local path returned by huggingface_hub.hf_hub_download.

Return type:

str

Raises:

RuntimeError – If Hugging Face Hub raises while resolving the asset; the original exception is chained with startup-focused context.

nemo_retriever.utils.hf_model_registry.install_pinned_hf_hub_download(module: Any) None[source]#

Patch an upstream module-level hf_hub_download to use registry pins.

Parameters:

module – Imported upstream module object expected to expose a top-level hf_hub_download function. If the attribute is absent, the helper logs a warning and leaves the module unchanged.

Returns:

The module is mutated in place when patching succeeds.

Return type:

None

nemo_retriever.utils.input_files module#

nemo_retriever.utils.input_files.expand_input_file_patterns(
input_paths: str | PathLike[str] | Iterable[str | PathLike[str]],
) list[str][source]#

Expand local path/glob inputs and reject missing or directory local literal paths.

Empty explicit glob matches are allowed so callers can intentionally describe optional file sets.

nemo_retriever.utils.input_files.input_type_for_path(input_path: str | PathLike[str]) str | None[source]#

Return the supported ingest input family for input_path’s extension.

nemo_retriever.utils.input_files.raise_input_path_not_found(
input_path: object,
cause: BaseException | None = None,
) NoReturn[source]#

Raise a consistent missing-input-path error.

Parameters:
  • input_path – Path, pattern, or list of paths attempted by the caller or file reader.

  • cause – Optional lower-level exception to preserve as the chained cause.

Raises:

FileNotFoundError – Always raised with a product-level missing-input-path message.

nemo_retriever.utils.input_files.resolve_input_files(
input_path: Path,
input_type: str,
) list[Path][source]#
nemo_retriever.utils.input_files.resolve_input_patterns(
input_path: Path,
input_type: str,
) list[str][source]#

nemo_retriever.utils.nvtx module#

nemo_retriever.utils.parquet_to_lancedb module#

nemo_retriever.utils.ray_resource_hueristics module#

pydantic model nemo_retriever.utils.ray_resource_hueristics.ClusterResources[source]#

Bases: BaseModel

Detected compute resources and where they came from.

Show JSON schema
{
   "title": "ClusterResources",
   "description": "Detected compute resources and where they came from.",
   "type": "object",
   "properties": {
      "total_resources": {
         "$ref": "#/$defs/Resources"
      },
      "available_resources": {
         "$ref": "#/$defs/Resources"
      }
   },
   "$defs": {
      "Resources": {
         "description": "Resources and where they came from.",
         "properties": {
            "cpu_count": {
               "title": "Cpu Count",
               "type": "integer"
            },
            "gpu_count": {
               "title": "Gpu Count",
               "type": "integer"
            }
         },
         "required": [
            "cpu_count",
            "gpu_count"
         ],
         "title": "Resources",
         "type": "object"
      }
   },
   "required": [
      "total_resources",
      "available_resources"
   ]
}

Config:
  • frozen: bool = True

Fields:
field available_resources: Resources [Required]#
field total_resources: Resources [Required]#
available_cpu_count() int[source]#
available_gpu_count() int[source]#
total_cpu_count() int[source]#
total_gpu_count() int[source]#
pydantic model nemo_retriever.utils.ray_resource_hueristics.GpuInfo[source]#

Bases: BaseModel

Show JSON schema
{
   "title": "GpuInfo",
   "type": "object",
   "properties": {
      "driver_version": {
         "title": "Driver Version",
         "type": "string"
      },
      "gpu_name": {
         "title": "Gpu Name",
         "type": "string"
      },
      "gpu_uuid": {
         "title": "Gpu Uuid",
         "type": "string"
      },
      "gpu_brand": {
         "title": "Gpu Brand",
         "type": "string"
      },
      "total_mib": {
         "title": "Total Mib",
         "type": "integer"
      },
      "used_mib": {
         "title": "Used Mib",
         "type": "integer"
      },
      "free_mib": {
         "title": "Free Mib",
         "type": "integer"
      }
   },
   "required": [
      "driver_version",
      "gpu_name",
      "gpu_uuid",
      "gpu_brand",
      "total_mib",
      "used_mib",
      "free_mib"
   ]
}

Fields:
field driver_version: str [Required]#
field free_mib: int [Required]#
field gpu_brand: str [Required]#
field gpu_name: str [Required]#
field gpu_uuid: str [Required]#
field total_mib: int [Required]#
field used_mib: int [Required]#
pydantic model nemo_retriever.utils.ray_resource_hueristics.NodeGpuInfo[source]#

Bases: BaseModel

Show JSON schema
{
   "title": "NodeGpuInfo",
   "type": "object",
   "properties": {
      "gpus": {
         "additionalProperties": {
            "$ref": "#/$defs/GpuInfo"
         },
         "title": "Gpus",
         "type": "object"
      }
   },
   "$defs": {
      "GpuInfo": {
         "properties": {
            "driver_version": {
               "title": "Driver Version",
               "type": "string"
            },
            "gpu_name": {
               "title": "Gpu Name",
               "type": "string"
            },
            "gpu_uuid": {
               "title": "Gpu Uuid",
               "type": "string"
            },
            "gpu_brand": {
               "title": "Gpu Brand",
               "type": "string"
            },
            "total_mib": {
               "title": "Total Mib",
               "type": "integer"
            },
            "used_mib": {
               "title": "Used Mib",
               "type": "integer"
            },
            "free_mib": {
               "title": "Free Mib",
               "type": "integer"
            }
         },
         "required": [
            "driver_version",
            "gpu_name",
            "gpu_uuid",
            "gpu_brand",
            "total_mib",
            "used_mib",
            "free_mib"
         ],
         "title": "GpuInfo",
         "type": "object"
      }
   },
   "required": [
      "gpus"
   ]
}

Fields:
field gpus: dict[int, GpuInfo] [Required]#
pydantic model nemo_retriever.utils.ray_resource_hueristics.RequestedPlan[source]#

Bases: BaseModel

Contains the requested Ray DAG plan for the batch ingest.

Show JSON schema
{
   "title": "RequestedPlan",
   "description": "Contains the requested Ray DAG plan for the batch ingest.",
   "type": "object",
   "properties": {
      "embed_initial_actors": {
         "title": "Embed Initial Actors",
         "type": "integer"
      },
      "embed_min_actors": {
         "title": "Embed Min Actors",
         "type": "integer"
      },
      "embed_max_actors": {
         "title": "Embed Max Actors",
         "type": "integer"
      },
      "embed_gpus_per_actor": {
         "title": "Embed Gpus Per Actor",
         "type": "number"
      },
      "embed_batch_size": {
         "title": "Embed Batch Size",
         "type": "integer"
      },
      "nemotron_parse_initial_actors": {
         "title": "Nemotron Parse Initial Actors",
         "type": "integer"
      },
      "nemotron_parse_min_actors": {
         "title": "Nemotron Parse Min Actors",
         "type": "integer"
      },
      "nemotron_parse_max_actors": {
         "title": "Nemotron Parse Max Actors",
         "type": "integer"
      },
      "nemotron_parse_gpus_per_actor": {
         "title": "Nemotron Parse Gpus Per Actor",
         "type": "number"
      },
      "nemotron_parse_batch_size": {
         "title": "Nemotron Parse Batch Size",
         "type": "integer"
      },
      "ocr_initial_actors": {
         "title": "Ocr Initial Actors",
         "type": "integer"
      },
      "ocr_min_actors": {
         "title": "Ocr Min Actors",
         "type": "integer"
      },
      "ocr_max_actors": {
         "title": "Ocr Max Actors",
         "type": "integer"
      },
      "ocr_gpus_per_actor": {
         "title": "Ocr Gpus Per Actor",
         "type": "number"
      },
      "ocr_batch_size": {
         "title": "Ocr Batch Size",
         "type": "integer"
      },
      "page_elements_initial_actors": {
         "title": "Page Elements Initial Actors",
         "type": "integer"
      },
      "page_elements_min_actors": {
         "title": "Page Elements Min Actors",
         "type": "integer"
      },
      "page_elements_max_actors": {
         "title": "Page Elements Max Actors",
         "type": "integer"
      },
      "page_elements_gpus_per_actor": {
         "title": "Page Elements Gpus Per Actor",
         "type": "number"
      },
      "page_elements_batch_size": {
         "title": "Page Elements Batch Size",
         "type": "integer"
      },
      "table_structure_initial_actors": {
         "title": "Table Structure Initial Actors",
         "type": "integer"
      },
      "table_structure_min_actors": {
         "title": "Table Structure Min Actors",
         "type": "integer"
      },
      "table_structure_max_actors": {
         "title": "Table Structure Max Actors",
         "type": "integer"
      },
      "table_structure_gpus_per_actor": {
         "title": "Table Structure Gpus Per Actor",
         "type": "number"
      },
      "table_structure_batch_size": {
         "title": "Table Structure Batch Size",
         "type": "integer"
      },
      "graphic_elements_initial_actors": {
         "title": "Graphic Elements Initial Actors",
         "type": "integer"
      },
      "graphic_elements_min_actors": {
         "title": "Graphic Elements Min Actors",
         "type": "integer"
      },
      "graphic_elements_max_actors": {
         "title": "Graphic Elements Max Actors",
         "type": "integer"
      },
      "graphic_elements_gpus_per_actor": {
         "title": "Graphic Elements Gpus Per Actor",
         "type": "number"
      },
      "graphic_elements_batch_size": {
         "title": "Graphic Elements Batch Size",
         "type": "integer"
      },
      "caption_gpus_per_actor": {
         "title": "Caption Gpus Per Actor",
         "type": "number"
      },
      "pdf_extract_batch_size": {
         "title": "Pdf Extract Batch Size",
         "type": "integer"
      },
      "pdf_extract_cpus_per_task": {
         "title": "Pdf Extract Cpus Per Task",
         "type": "number"
      },
      "pdf_extract_tasks": {
         "title": "Pdf Extract Tasks",
         "type": "integer"
      }
   },
   "required": [
      "embed_initial_actors",
      "embed_min_actors",
      "embed_max_actors",
      "embed_gpus_per_actor",
      "embed_batch_size",
      "nemotron_parse_initial_actors",
      "nemotron_parse_min_actors",
      "nemotron_parse_max_actors",
      "nemotron_parse_gpus_per_actor",
      "nemotron_parse_batch_size",
      "ocr_initial_actors",
      "ocr_min_actors",
      "ocr_max_actors",
      "ocr_gpus_per_actor",
      "ocr_batch_size",
      "page_elements_initial_actors",
      "page_elements_min_actors",
      "page_elements_max_actors",
      "page_elements_gpus_per_actor",
      "page_elements_batch_size",
      "table_structure_initial_actors",
      "table_structure_min_actors",
      "table_structure_max_actors",
      "table_structure_gpus_per_actor",
      "table_structure_batch_size",
      "graphic_elements_initial_actors",
      "graphic_elements_min_actors",
      "graphic_elements_max_actors",
      "graphic_elements_gpus_per_actor",
      "graphic_elements_batch_size",
      "caption_gpus_per_actor",
      "pdf_extract_batch_size",
      "pdf_extract_cpus_per_task",
      "pdf_extract_tasks"
   ]
}

Config:
  • frozen: bool = True

Fields:
field caption_gpus_per_actor: float [Required]#
field embed_batch_size: int [Required]#
field embed_gpus_per_actor: float [Required]#
field embed_initial_actors: int [Required]#
field embed_max_actors: int [Required]#
field embed_min_actors: int [Required]#
field graphic_elements_batch_size: int [Required]#
field graphic_elements_gpus_per_actor: float [Required]#
field graphic_elements_initial_actors: int [Required]#
field graphic_elements_max_actors: int [Required]#
field graphic_elements_min_actors: int [Required]#
field nemotron_parse_batch_size: int [Required]#
field nemotron_parse_gpus_per_actor: float [Required]#
field nemotron_parse_initial_actors: int [Required]#
field nemotron_parse_max_actors: int [Required]#
field nemotron_parse_min_actors: int [Required]#
field ocr_batch_size: int [Required]#
field ocr_gpus_per_actor: float [Required]#
field ocr_initial_actors: int [Required]#
field ocr_max_actors: int [Required]#
field ocr_min_actors: int [Required]#
field page_elements_batch_size: int [Required]#
field page_elements_gpus_per_actor: float [Required]#
field page_elements_initial_actors: int [Required]#
field page_elements_max_actors: int [Required]#
field page_elements_min_actors: int [Required]#
field pdf_extract_batch_size: int [Required]#
field pdf_extract_cpus_per_task: float [Required]#
field pdf_extract_tasks: int [Required]#
field table_structure_batch_size: int [Required]#
field table_structure_gpus_per_actor: float [Required]#
field table_structure_initial_actors: int [Required]#
field table_structure_max_actors: int [Required]#
field table_structure_min_actors: int [Required]#
get_embed_batch_size() int[source]#
get_embed_gpus_per_actor() float[source]#
get_embed_initial_actors() int[source]#
get_embed_max_actors() int[source]#
get_embed_min_actors() int[source]#
get_graphic_elements_batch_size() int[source]#
get_graphic_elements_gpus_per_actor() float[source]#
get_graphic_elements_initial_actors() int[source]#
get_graphic_elements_max_actors() int[source]#
get_graphic_elements_min_actors() int[source]#
get_nemotron_parse_batch_size() int[source]#
get_nemotron_parse_gpus_per_actor() float[source]#
get_nemotron_parse_initial_actors() int[source]#
get_nemotron_parse_max_actors() int[source]#
get_nemotron_parse_min_actors() int[source]#
get_ocr_batch_size() int[source]#
get_ocr_gpus_per_actor() float[source]#
get_ocr_initial_actors() int[source]#
get_ocr_max_actors() int[source]#
get_ocr_min_actors() int[source]#
get_page_elements_batch_size() int[source]#
get_page_elements_gpus_per_actor() float[source]#
get_page_elements_initial_actors() int[source]#
get_page_elements_max_actors() int[source]#
get_page_elements_min_actors() int[source]#
get_pdf_extract_batch_size() int[source]#
get_pdf_extract_cpus_per_task() float[source]#
get_pdf_extract_tasks() int[source]#
get_table_structure_batch_size() int[source]#
get_table_structure_gpus_per_actor() float[source]#
get_table_structure_initial_actors() int[source]#
get_table_structure_max_actors() int[source]#
get_table_structure_min_actors() int[source]#
pydantic model nemo_retriever.utils.ray_resource_hueristics.Resources[source]#

Bases: BaseModel

Resources and where they came from.

Show JSON schema
{
   "title": "Resources",
   "description": "Resources and where they came from.",
   "type": "object",
   "properties": {
      "cpu_count": {
         "title": "Cpu Count",
         "type": "integer"
      },
      "gpu_count": {
         "title": "Gpu Count",
         "type": "integer"
      }
   },
   "required": [
      "cpu_count",
      "gpu_count"
   ]
}

Config:
  • frozen: bool = True

Fields:
field cpu_count: int [Required]#
field gpu_count: int [Required]#
nemo_retriever.utils.ray_resource_hueristics.gather_cluster_resources(
ray: object,
) ClusterResources[source]#

Gather total and available CPU/GPU resources from a Ray cluster.

nemo_retriever.utils.ray_resource_hueristics.gather_local_resources() Resources[source]#

Gather local CPU/GPU resources without requiring Ray.

nemo_retriever.utils.ray_resource_hueristics.get_gpu_memory_info_remote() object[source]#

Return a Ray ObjectRef for _get_gpu_memory_info executed remotely.

nemo_retriever.utils.ray_resource_hueristics.resolve_requested_plan(
*,
cluster_resources: ClusterResources,
override_embed_initial_actors: int | None = None,
override_embed_min_actors: int | None = None,
override_embed_max_actors: int | None = None,
override_embed_gpus_per_actor: float | None = None,
override_embed_batch_size: int | None = None,
override_nemotron_parse_initial_actors: int | None = None,
override_nemotron_parse_min_actors: int | None = None,
override_nemotron_parse_max_actors: int | None = None,
override_nemotron_parse_gpus_per_actor: float | None = None,
override_nemotron_parse_batch_size: int | None = None,
override_ocr_initial_actors: int | None = None,
override_ocr_min_actors: int | None = None,
override_ocr_max_actors: int | None = None,
override_ocr_gpus_per_actor: float | None = None,
override_ocr_batch_size: int | None = None,
override_page_elements_initial_actors: int | None = None,
override_page_elements_min_actors: int | None = None,
override_page_elements_max_actors: int | None = None,
override_page_elements_gpus_per_actor: float | None = None,
override_page_elements_batch_size: int | None = None,
override_table_structure_initial_actors: int | None = None,
override_table_structure_min_actors: int | None = None,
override_table_structure_max_actors: int | None = None,
override_table_structure_gpus_per_actor: float | None = None,
override_table_structure_batch_size: int | None = None,
override_graphic_elements_initial_actors: int | None = None,
override_graphic_elements_min_actors: int | None = None,
override_graphic_elements_max_actors: int | None = None,
override_graphic_elements_gpus_per_actor: float | None = None,
override_graphic_elements_batch_size: int | None = None,
override_pdf_extract_batch_size: int | None = None,
override_pdf_extract_cpus_per_task: float | None = None,
override_pdf_extract_tasks: int | None = None,
allow_no_gpu: bool = False,
caption_enabled: bool = False,
override_caption_gpus_per_actor: float | None = None,
) RequestedPlan[source]#

nemo_retriever.utils.remote_auth module#

nemo_retriever.utils.remote_auth.collect_remote_auth_runtime_env(
*,
extra_keys: Iterable[str] = (),
) dict[str, str][source]#

Collect non-HF remote auth env vars historically forwarded to Ray workers.

nemo_retriever.utils.remote_auth.resolve_remote_api_key(
explicit_api_key: str | None = None,
) str | None[source]#

Resolve bearer token for hosted NIM endpoints.

nemo_retriever.utils.table_and_chart module#

Table/chart/infographic content reconstruction utilities.

Ports bbox-matching and content-reconstruction algorithms from nemo_retriever.api.util.image_processing.table_and_chart and adds adapter functions that convert the retriever’s detection/OCR formats into the pixel-coordinate representations expected by the core joining routines.

nemo_retriever.utils.table_and_chart.assign_boxes(
ocr_box: ndarray,
boxes: ndarray,
delta: float = 2.0,
min_overlap: float = 0.25,
) ndarray[source]#

Area-normalized overlap matching for table structure.

nemo_retriever.utils.table_and_chart.build_markdown(df: DataFrame) list[source]#

Convert a dataframe with row_ids/col_ids/text into a markdown matrix.

nemo_retriever.utils.table_and_chart.display_markdown(data: list, use_header: bool = False) str[source]#

Convert a list-of-lists into a markdown table string.

nemo_retriever.utils.table_and_chart.join_graphic_elements_and_ocr_output(
ge_dets: List[Dict[str, Any]],
ocr_preds: Any,
crop_hw: Tuple[int, int],
) str[source]#

Adapter: convert retriever graphic-elements detections + OCR items, then call the core joining + concatenation functions.

Parameters:
  • ge_dets (list[dict]) – From _prediction_to_detections() with chart-element label_names and bbox_xyxy_norm in [0, 1].

  • ocr_preds (list | dict) – Raw OCR output from NemotronOCRV1.invoke().

  • crop_hw ((int, int)) – (H, W) of the crop image.

nemo_retriever.utils.table_and_chart.join_table_structure_and_ocr_output(
structure_dets: List[Dict[str, Any]],
ocr_preds: Any,
crop_hw: Tuple[int, int],
) str[source]#

Adapter: convert retriever table-structure detections + OCR items, then call the core joining function.

Parameters:
  • structure_dets (list[dict]) – From _prediction_to_detections() with label_names cell/row/column and bbox_xyxy_norm in [0, 1].

  • ocr_preds (list | dict) – Raw OCR output from NemotronOCRV1.invoke().

  • crop_hw ((int, int)) – (H, W) of the crop image.

nemo_retriever.utils.table_and_chart.match_bboxes(
yolox_box: ndarray,
ocr_boxes: ndarray,
already_matched: list | None = None,
delta: float = 2.0,
) ndarray[source]#

Union-based IoU matching for chart graphic elements.

nemo_retriever.utils.table_and_chart.merge_text_in_cell(
df_cell: DataFrame,
) DataFrame[source]#

Merge text from multiple OCR items inside one table cell.

nemo_retriever.utils.table_and_chart.process_yolox_graphic_elements(
yolox_text_dict: Dict[str, str],
) str[source]#

Concatenate chart text by semantic region.

nemo_retriever.utils.table_and_chart.remove_empty_row(mat: list) list[source]#

Remove empty rows from a matrix.

nemo_retriever.utils.table_and_chart.reorder_boxes(
boxes: ndarray,
texts: list,
confs: list,
mode: str = 'top_left',
dbscan_eps: float = 10,
) Tuple[list, list, list][source]#

Reorder OCR boxes in reading order using DBSCAN clustering.

nemo_retriever.utils.table_and_chart.reorder_ocr_for_infographic(
ocr_preds: Any,
crop_hw: Tuple[int, int],
) str[source]#

Adapter: convert OCR items to pixel-coord quad boxes, reorder in reading order, and return joined text.

Module contents#

Utility command/tooling subpackages for nemo_retriever.