nv_ingest_api.internal.primitives.nim.model_interface package#

Submodules#

nv_ingest_api.internal.primitives.nim.model_interface.cached module#

class nv_ingest_api.internal.primitives.nim.model_interface.cached.CachedModelInterface[source]#

Bases: ModelInterface

An interface for handling inference with a Cached model, supporting both gRPC and HTTP protocols, including batched input.

format_input(
data: Dict[str, Any],
protocol: str,
max_batch_size: int,
**kwargs,
) Any[source]#

Format input data for the specified protocol (“grpc” or “http”), handling batched images. Additionally, returns batched data that coalesces the original image arrays and their dimensions in the same order as provided.

Parameters:
  • data (dict of str -> Any) – The input data dictionary, expected to contain “image_arrays” (a list of np.ndarray).

  • protocol (str) – The protocol to use, “grpc” or “http”.

  • max_batch_size (int) – The maximum number of images per batch.

Returns:

A tuple (formatted_batches, formatted_batch_data) where:
  • For gRPC: formatted_batches is a list of NumPy arrays, each of shape (B, H, W, C) with B <= max_batch_size.

  • For HTTP: formatted_batches is a list of JSON-serializable dict payloads.

  • In both cases, formatted_batch_data is a list of dicts with the keys:

    ”image_arrays”: the list of original np.ndarray images for that batch, and “image_dims”: a list of (height, width) tuples for each image in the batch.

Return type:

tuple

Raises:
  • KeyError – If “image_arrays” is missing in the data dictionary.

  • ValueError – If the protocol is invalid, or if no valid images are found.

name() str[source]#

Get the name of the model interface.

Returns:

The name of the model interface (“Cached”).

Return type:

str

parse_output(
response: Any,
protocol: str,
data: Dict[str, Any] | None = None,
**kwargs: Any,
) Any[source]#

Parse the output from the Cached model’s inference response.

Parameters:
  • response (Any) – The raw response from the model inference.

  • protocol (str) – The protocol used (“grpc” or “http”).

  • data (dict of str -> Any, optional) – Additional input data (unused here, but available for consistency).

  • **kwargs (Any) – Additional keyword arguments for future compatibility.

Returns:

The parsed output data (e.g., list of strings), depending on the protocol.

Return type:

Any

Raises:
  • ValueError – If the protocol is invalid.

  • RuntimeError – If the HTTP response is not as expected (missing ‘data’ key).

prepare_data_for_inference(
data: Dict[str, Any],
) Dict[str, Any][source]#

Decode base64-encoded images into NumPy arrays, storing them in data[“image_arrays”].

Parameters:

data (dict of str -> Any) –

The input data containing either:
  • ”base64_image”: a single base64-encoded image, or

  • ”base64_images”: a list of base64-encoded images.

Returns:

The updated data dictionary with decoded image arrays stored in “image_arrays”, where each array has shape (H, W, C).

Return type:

dict of str -> Any

Raises:
  • KeyError – If neither ‘base64_image’ nor ‘base64_images’ is provided.

  • ValueError – If ‘base64_images’ is provided but is not a list.

process_inference_results(
output: Any,
protocol: str,
**kwargs: Any,
) Any[source]#

Process inference results for the Cached model.

Parameters:
  • output (Any) – The raw output from the model.

  • protocol (str) – The inference protocol used (“grpc” or “http”).

  • **kwargs (Any) – Additional parameters for post-processing (not used here).

Returns:

The processed inference results, which here is simply returned as-is.

Return type:

Any

nv_ingest_api.internal.primitives.nim.model_interface.decorators module#

nv_ingest_api.internal.primitives.nim.model_interface.decorators.multiprocessing_cache(max_calls)[source]#

A decorator that creates a global cache shared between multiple processes. The cache is invalidated after max_calls number of accesses.

Parameters:

max_calls (int) – The number of calls after which the cache is cleared.

Returns:

The decorated function with global cache and invalidation logic.

Return type:

function

nv_ingest_api.internal.primitives.nim.model_interface.deplot module#

class nv_ingest_api.internal.primitives.nim.model_interface.deplot.DeplotModelInterface[source]#

Bases: ModelInterface

An interface for handling inference with a Deplot model, supporting both gRPC and HTTP protocols, now updated to handle multiple base64 images (‘base64_images’).

format_input(
data: Dict[str, Any],
protocol: str,
max_batch_size: int,
**kwargs,
) Any[source]#

Format input data for the specified protocol (gRPC or HTTP) for Deplot. For HTTP, we now construct multiple messages—one per image batch—along with corresponding batch data carrying the original image arrays and their dimensions.

Parameters:
  • data (dict of str -> Any) – The input data dictionary, expected to contain “image_arrays” (a list of np.ndarray).

  • protocol (str) – The protocol to use, “grpc” or “http”.

  • max_batch_size (int) – The maximum number of images per batch.

  • kwargs (dict) – Additional parameters to pass to the payload preparation (for HTTP).

Returns:

(formatted_batches, formatted_batch_data) where:
  • For gRPC: formatted_batches is a list of NumPy arrays, each of shape (B, H, W, C) with B <= max_batch_size.

  • For HTTP: formatted_batches is a list of JSON-serializable payload dicts.

  • In both cases, formatted_batch_data is a list of dicts containing:

    ”image_arrays”: the list of original np.ndarray images for that batch, and “image_dims”: a list of (height, width) tuples for each image in the batch.

Return type:

tuple

Raises:
  • KeyError – If “image_arrays” is missing in the data dictionary.

  • ValueError – If the protocol is invalid, or if no valid images are found.

name() str[source]#

Get the name of the model interface.

Returns:

The name of the model interface (“Deplot”).

Return type:

str

parse_output(
response: Any,
protocol: str,
data: Dict[str, Any] | None = None,
**kwargs,
) Any[source]#

Parse the model’s inference response.

prepare_data_for_inference(
data: Dict[str, Any],
) Dict[str, Any][source]#

Prepare input data by decoding one or more base64-encoded images into NumPy arrays.

Parameters:

data (dict) – The input data containing either ‘base64_image’ (single image) or ‘base64_images’ (multiple images).

Returns:

The updated data dictionary with ‘image_arrays’: a list of decoded NumPy arrays.

Return type:

dict

process_inference_results(
output: Any,
protocol: str,
**kwargs,
) Any[source]#

Process inference results for the Deplot model.

Parameters:
  • output (Any) – The raw output from the model.

  • protocol (str) – The protocol used for inference (gRPC or HTTP).

Returns:

The processed inference results.

Return type:

Any

nv_ingest_api.internal.primitives.nim.model_interface.helpers module#

nv_ingest_api.internal.primitives.nim.model_interface.helpers.get_model_name(
http_endpoint: str,
default_model_name,
metadata_endpoint: str = '/v1/metadata',
model_info_field: str = 'modelInfo',
) str[source]#

Get the model name of the server from its metadata endpoint.

Parameters:
  • http_endpoint (str) – The HTTP endpoint of the server.

  • metadata_endpoint (str, optional) – The metadata endpoint to query (default: “/v1/metadata”).

  • model_info_field (str, optional) – The field containing the model info in the response (default: “modelInfo”).

Returns:

The model name of the server, or an empty string if unavailable.

Return type:

str

nv_ingest_api.internal.primitives.nim.model_interface.helpers.get_version(
http_endpoint: str,
metadata_endpoint: str = '/v1/metadata',
version_field: str = 'version',
) str[source]#

Get the version of the server from its metadata endpoint.

Parameters:
  • http_endpoint (str) – The HTTP endpoint of the server.

  • metadata_endpoint (str, optional) – The metadata endpoint to query (default: “/v1/metadata”).

  • version_field (str, optional) – The field containing the version in the response (default: “version”).

Returns:

The version of the server, or an empty string if unavailable.

Return type:

str

nv_ingest_api.internal.primitives.nim.model_interface.helpers.is_ready(http_endpoint: str, ready_endpoint: str) bool[source]#

Check if the server at the given endpoint is ready.

Parameters:
  • http_endpoint (str) – The HTTP endpoint of the server.

  • ready_endpoint (str) – The specific ready-check endpoint.

Returns:

True if the server is ready, False otherwise.

Return type:

bool

nv_ingest_api.internal.primitives.nim.model_interface.helpers.preprocess_image_for_paddle(
array: ndarray,
image_max_dimension: int = 960,
) ndarray[source]#

Preprocesses an input image to be suitable for use with PaddleOCR by resizing, normalizing, padding, and transposing it into the required format.

This function is intended for preprocessing images to be passed as input to PaddleOCR using GRPC. It is not necessary when using the HTTP endpoint.

Steps:#

  1. Resizes the image while maintaining aspect ratio such that its largest dimension is scaled to 960 pixels.

  2. Normalizes the image using the normalize_image function.

  3. Pads the image to ensure both its height and width are multiples of 32, as required by PaddleOCR.

  4. Transposes the image from (height, width, channel) to (channel, height, width), the format expected by PaddleOCR.

Parameters:#

arraynp.ndarray

The input image array of shape (height, width, channels). It should have pixel values in the range [0, 255].

Returns:#

np.ndarray

A preprocessed image with the shape (channels, height, width) and normalized pixel values. The image will be padded to have dimensions that are multiples of 32, with the padding color set to 0.

Notes:#

  • The image is resized so that its largest dimension becomes 960 pixels, maintaining the aspect ratio.

  • After normalization, the image is padded to the nearest multiple of 32 in both dimensions, which is a requirement for PaddleOCR.

  • The normalized pixel values are scaled between 0 and 1 before padding and transposing the image.

nv_ingest_api.internal.primitives.nim.model_interface.nemoretriever_parse module#

class nv_ingest_api.internal.primitives.nim.model_interface.nemoretriever_parse.NemoRetrieverParseModelInterface(
model_name: str = 'nvidia/nemoretriever-parse',
)[source]#

Bases: ModelInterface

An interface for handling inference with a NemoRetrieverParse model.

format_input(
data: Dict[str, Any],
protocol: str,
max_batch_size: int,
**kwargs,
) Any[source]#

Format input data for the specified protocol.

Parameters:
  • data (dict) – The input data to format.

  • protocol (str) – The protocol to use (“grpc” or “http”).

  • **kwargs (dict) – Additional parameters for HTTP payload formatting.

Returns:

The formatted input data.

Return type:

Any

Raises:

ValueError – If an invalid protocol is specified.

name() str[source]#

Get the name of the model interface.

Returns:

The name of the model interface.

Return type:

str

parse_output(
response: Any,
protocol: str,
data: Dict[str, Any] | None = None,
**kwargs,
) Any[source]#

Parse the output from the model’s inference response.

Parameters:
  • response (Any) – The response from the model inference.

  • protocol (str) – The protocol used (“grpc” or “http”).

  • data (dict, optional) – Additional input data passed to the function.

Returns:

The parsed output data.

Return type:

Any

Raises:

ValueError – If an invalid protocol is specified.

prepare_data_for_inference(
data: Dict[str, Any],
) Dict[str, Any][source]#

Prepare input data for inference by resizing images and storing their original shapes.

Parameters:

data (dict) – The input data containing a list of images.

Returns:

The updated data dictionary with resized images and original image shapes.

Return type:

dict

process_inference_results(
output: Any,
**kwargs,
) Any[source]#

Process inference results for the NemoRetrieverParse model.

Parameters:

output (Any) – The raw output from the model.

Returns:

The processed inference results.

Return type:

Any

nv_ingest_api.internal.primitives.nim.model_interface.paddle module#

class nv_ingest_api.internal.primitives.nim.model_interface.paddle.PaddleOCRModelInterface[source]#

Bases: ModelInterface

An interface for handling inference with a PaddleOCR model, supporting both gRPC and HTTP protocols.

format_input(
data: Dict[str, Any],
protocol: str,
max_batch_size: int,
**kwargs,
) Any[source]#

Format input data for the specified protocol (“grpc” or “http”), supporting batched data.

Parameters:
  • data (dict of str -> Any) – The input data dictionary, expected to contain “image_arrays” (list of np.ndarray) and “image_dims” (list of (height, width) tuples), as produced by prepare_data_for_inference.

  • protocol (str) – The inference protocol, either “grpc” or “http”.

  • max_batch_size (int) – The maximum batch size for batching.

Returns:

A tuple (formatted_batches, formatted_batch_data) where:
  • formatted_batches is a list of batches ready for inference.

  • formatted_batch_data is a list of scratch-pad dictionaries corresponding to each batch, containing the keys “image_arrays” and “image_dims” for later post-processing.

Return type:

tuple

Raises:
  • KeyError – If either “image_arrays” or “image_dims” is not found in data.

  • ValueError – If an invalid protocol is specified.

name() str[source]#

Get the name of the model interface.

Returns:

The name of the model interface.

Return type:

str

parse_output(
response: Any,
protocol: str,
data: Dict[str, Any] | None = None,
**kwargs: Any,
) Any[source]#

Parse the model’s inference response for the given protocol. The parsing may handle batched outputs for multiple images.

Parameters:
  • response (Any) – The raw response from the PaddleOCR model.

  • protocol (str) – The protocol used for inference, “grpc” or “http”.

  • data (dict of str -> Any, optional) – Additional data dictionary that may include “image_dims” for bounding box scaling.

  • **kwargs (Any) – Additional keyword arguments, such as custom table_content_format.

Returns:

The parsed output, typically a list of (content, table_content_format) tuples.

Return type:

Any

Raises:

ValueError – If an invalid protocol is specified.

prepare_data_for_inference(
data: Dict[str, Any],
) Dict[str, Any][source]#

Decode one or more base64-encoded images into NumPy arrays, storing them alongside their dimensions in data.

Parameters:

data (dict of str -> Any) –

The input data containing either:
  • ’base64_image’: a single base64-encoded image, or

  • ’base64_images’: a list of base64-encoded images.

Returns:

The updated data dictionary with the following keys added: - “image_arrays”: List of decoded NumPy arrays of shape (H, W, C). - “image_dims”: List of (height, width) tuples for each decoded image.

Return type:

dict of str -> Any

Raises:
  • KeyError – If neither ‘base64_image’ nor ‘base64_images’ is found in data.

  • ValueError – If ‘base64_images’ is present but is not a list.

process_inference_results(
output: Any,
**kwargs: Any,
) Any[source]#

Process inference results for the PaddleOCR model.

Parameters:
  • output (Any) – The raw output parsed from the PaddleOCR model.

  • **kwargs (Any) – Additional keyword arguments for customization.

Returns:

The post-processed inference results. By default, this simply returns the output as the table content (or content list).

Return type:

Any

nv_ingest_api.internal.primitives.nim.model_interface.parakeet module#

class nv_ingest_api.internal.primitives.nim.model_interface.parakeet.ParakeetClient(
endpoint: str,
auth_token: str | None = None,
function_id: str | None = None,
use_ssl: bool | None = None,
ssl_cert: str | None = None,
)[source]#

Bases: object

A simple interface for handling inference with a Parakeet model (e.g., speech, audio-related).

infer(
data: dict,
model_name: str,
**kwargs,
) Any[source]#

Perform inference using the specified model and input data.

Parameters:
  • data (dict) – The input data for inference.

  • model_name (str) – The model name.

  • kwargs (dict) – Additional parameters for inference.

Returns:

The processed inference results, coalesced in the same order as the input images.

Return type:

Any

transcribe(
audio_content: str,
language_code: str = 'en-US',
automatic_punctuation: bool = True,
word_time_offsets: bool = True,
max_alternatives: int = 1,
profanity_filter: bool = False,
verbatim_transcripts: bool = True,
speaker_diarization: bool = False,
boosted_lm_words: List[str] | None = None,
boosted_lm_score: float = 0.0,
diarization_max_speakers: int = 0,
start_history: float = 0.0,
start_threshold: float = 0.0,
stop_history: float = 0.0,
stop_history_eou: bool = False,
stop_threshold: float = 0.0,
stop_threshold_eou: bool = False,
)[source]#

Transcribe an audio file using Riva ASR.

Parameters:
  • audio_content (str) – Base64-encoded audio content to be transcribed.

  • language_code (str, default="en-US") – The language code for transcription.

  • automatic_punctuation (bool, default=True) – Whether to enable automatic punctuation in the transcript.

  • word_time_offsets (bool, default=True) – Whether to include word-level timestamps in the transcript.

  • max_alternatives (int, default=1) – The maximum number of alternative transcripts to return.

  • profanity_filter (bool, default=False) – Whether to filter out profanity from the transcript.

  • verbatim_transcripts (bool, default=True) – Whether to return verbatim transcripts without normalization.

  • speaker_diarization (bool, default=False) – Whether to enable speaker diarization.

  • boosted_lm_words (Optional[List[str]], default=None) – A list of words to boost for language modeling.

  • boosted_lm_score (float, default=0.0) – The boosting score for language model words.

  • diarization_max_speakers (int, default=0) – The maximum number of speakers to differentiate in speaker diarization.

  • start_history (float, default=0.0) – History window size for endpoint detection.

  • start_threshold (float, default=0.0) – The threshold for starting speech detection.

  • stop_history (float, default=0.0) – History window size for stopping speech detection.

  • stop_history_eou (bool, default=False) – Whether to use an end-of-utterance flag for stopping detection.

  • stop_threshold (float, default=0.0) – The threshold for stopping speech detection.

  • stop_threshold_eou (bool, default=False) – Whether to use an end-of-utterance flag for stop threshold.

Returns:

The response containing the transcription results. Returns None if the transcription fails.

Return type:

Optional[riva.client.RecognitionResponse]

nv_ingest_api.internal.primitives.nim.model_interface.parakeet.convert_to_mono_wav(audio_bytes)[source]#

Convert an audio file to mono WAV format using Librosa and SciPy.

Parameters:

audio_bytes (bytes) – The raw audio data in bytes.

Returns:

The processed audio in mono WAV format.

Return type:

bytes

nv_ingest_api.internal.primitives.nim.model_interface.parakeet.create_audio_inference_client(
endpoints: Tuple[str, str],
infer_protocol: str | None = None,
auth_token: str | None = None,
function_id: str | None = None,
use_ssl: bool = False,
ssl_cert: str | None = None,
)[source]#

Create a ParakeetClient for interfacing with an audio model inference server.

Parameters:
  • endpoints (tuple) – A tuple containing the gRPC and HTTP endpoints. Only the gRPC endpoint is used.

  • infer_protocol (str, optional) – The protocol to use (“grpc” or “http”). If not specified, defaults to “grpc” if a valid gRPC endpoint is provided. HTTP endpoints are not supported for audio inference.

  • auth_token (str, optional) – Authorization token for authentication (default: None).

  • function_id (str, optional) – NVCF function ID of the invocation (default: None)

  • use_ssl (bool, optional) – Whether to use SSL for secure communication (default: False).

  • ssl_cert (str, optional) – Path to the SSL certificate file if use_ssl is enabled (default: None).

Returns:

The initialized ParakeetClient configured for audio inference over gRPC.

Return type:

ParakeetClient

Raises:

ValueError – If an invalid infer_protocol is specified or if an HTTP endpoint is provided.

nv_ingest_api.internal.primitives.nim.model_interface.parakeet.process_transcription_response(response)[source]#
Process a Riva transcription response (a protobuf message) to extract:
  • final_transcript: the complete transcript.

  • segments: a list of segments with start/end times and text.

Parameters:

response – The Riva transcription response message.

Returns:

Each segment is a dict with keys “start”, “end”, and “text”. final_transcript (str): The overall transcript.

Return type:

segments (list)

nv_ingest_api.internal.primitives.nim.model_interface.text_embedding module#

class nv_ingest_api.internal.primitives.nim.model_interface.text_embedding.EmbeddingModelInterface[source]#

Bases: ModelInterface

An interface for handling inference with an embedding model endpoint. This implementation supports HTTP inference for generating embeddings from text prompts.

format_input(
data: Dict[str, Any],
protocol: str,
max_batch_size: int,
**kwargs,
) Tuple[List[Any], List[Dict[str, Any]]][source]#

Format the input payload for the embedding endpoint. This method constructs one payload per batch, where each payload includes a list of text prompts. Additionally, it returns batch data that preserves the original order of prompts.

Parameters:
  • data (dict) – The input data containing “prompts” (a list of text prompts).

  • protocol (str) – Only “http” is supported.

  • max_batch_size (int) – Maximum number of prompts per payload.

  • kwargs (dict) – Additional parameters including model_name, encoding_format, input_type, and truncate.

Returns:

A tuple (payloads, batch_data_list) where:
  • payloads is a list of JSON-serializable payload dictionaries.

  • batch_data_list is a list of dictionaries containing the key “prompts” corresponding to each batch.

Return type:

tuple

name() str[source]#

Return the name of this model interface.

parse_output(
response: Any,
protocol: str,
data: Dict[str, Any] | None = None,
**kwargs,
) Any[source]#

Parse the HTTP response from the embedding endpoint. Expects a response structure with a “data” key.

Parameters:
  • response (Any) – The raw HTTP response (assumed to be already decoded as JSON).

  • protocol (str) – Only “http” is supported.

  • data (dict, optional) – The original input data.

  • kwargs (dict) – Additional keyword arguments.

Returns:

A list of generated embeddings extracted from the response.

Return type:

list

prepare_data_for_inference(
data: Dict[str, Any],
) Dict[str, Any][source]#

Prepare input data for embedding inference. Returns a list of strings representing the text to be embedded.

process_inference_results(
output: Any,
protocol: str,
**kwargs,
) Any[source]#

Process inference results for the embedding model. For this implementation, the output is expected to be a list of embeddings.

Returns:

The processed list of embeddings.

Return type:

list

nv_ingest_api.internal.primitives.nim.model_interface.vlm module#

class nv_ingest_api.internal.primitives.nim.model_interface.vlm.VLMModelInterface[source]#

Bases: ModelInterface

An interface for handling inference with a VLM model endpoint (e.g., NVIDIA LLaMA-based VLM). This implementation supports HTTP inference with one or more base64-encoded images and a caption prompt.

format_input(
data: Dict[str, Any],
protocol: str,
max_batch_size: int,
**kwargs,
) Tuple[List[Any], List[Dict[str, Any]]][source]#

Format the input payload for the VLM endpoint. This method constructs one payload per batch, where each payload includes one message per image in the batch. Additionally, it returns batch data that preserves the original order of images by including the list of base64 images and the prompt for each batch.

Parameters:
  • data (dict) – The input data containing “base64_images” (a list of base64-encoded images) and “prompt”.

  • protocol (str) – Only “http” is supported.

  • max_batch_size (int) – Maximum number of images per payload.

  • kwargs (dict) – Additional parameters including model_name, max_tokens, temperature, top_p, and stream.

Returns:

A tuple (payloads, batch_data_list) where:
  • payloads is a list of JSON-serializable payload dictionaries.

  • batch_data_list is a list of dictionaries containing the keys “base64_images” and “prompt” corresponding to each batch.

Return type:

tuple

name() str[source]#

Return the name of this model interface.

parse_output(
response: Any,
protocol: str,
data: Dict[str, Any] | None = None,
**kwargs,
) Any[source]#

Parse the HTTP response from the VLM endpoint. Expects a response structure with a “choices” key.

Parameters:
  • response (Any) – The raw HTTP response (assumed to be already decoded as JSON).

  • protocol (str) – Only “http” is supported.

  • data (dict, optional) – The original input data.

  • kwargs (dict) – Additional keyword arguments.

Returns:

A list of generated captions extracted from the response.

Return type:

list

prepare_data_for_inference(
data: Dict[str, Any],
) Dict[str, Any][source]#

Prepare input data for VLM inference. Accepts either a single base64 image or a list of images. Ensures that a ‘prompt’ is provided.

Raises:
  • KeyError – If neither “base64_image” nor “base64_images” is provided or if “prompt” is missing.

  • ValueError – If “base64_images” exists but is not a list.

process_inference_results(
output: Any,
protocol: str,
**kwargs,
) Any[source]#

Process inference results for the VLM model. For this implementation, the output is expected to be a list of captions.

Returns:

The processed list of captions.

Return type:

list

nv_ingest_api.internal.primitives.nim.model_interface.yolox module#

class nv_ingest_api.internal.primitives.nim.model_interface.yolox.YoloxGraphicElementsModelInterface(yolox_version: str | None = None)[source]#

Bases: YoloxModelInterfaceBase

An interface for handling inference with yolox-graphic-elemenents model, supporting both gRPC and HTTP protocols.

name() str[source]#

Returns the name of the Yolox model interface.

Returns:

The name of the model interface.

Return type:

str

postprocess_annotations(
annotation_dicts,
**kwargs,
)[source]#
class nv_ingest_api.internal.primitives.nim.model_interface.yolox.YoloxModelInterfaceBase(
image_preproc_width: int | None = None,
image_preproc_height: int | None = None,
nim_max_image_size: int | None = None,
num_classes: int | None = None,
conf_threshold: float | None = None,
iou_threshold: float | None = None,
min_score: float | None = None,
final_score: float | None = None,
class_labels: List[str] | None = None,
)[source]#

Bases: ModelInterface

An interface for handling inference with a Yolox object detection model, supporting both gRPC and HTTP protocols.

format_input(
data: Dict[str, Any],
protocol: str,
max_batch_size: int,
**kwargs,
) Tuple[List[Any], List[Dict[str, Any]]][source]#
Format input data for the specified protocol, returning a tuple of:

(formatted_batches, formatted_batch_data)

where:
  • For gRPC: formatted_batches is a list of NumPy arrays, each of shape (B, H, W, C) with B <= max_batch_size.

  • For HTTP: formatted_batches is a list of JSON-serializable dict payloads.

  • In both cases, formatted_batch_data is a list of dicts that coalesce the original images and their original shapes in the same order as provided.

Parameters:
  • data (dict) –

    The input data to format. Must include:
    • ”images”: a list of numpy.ndarray images.

    • ”original_image_shapes”: a list of tuples with each image’s (height, width), as set by prepare_data_for_inference.

  • protocol (str) – The protocol to use (“grpc” or “http”).

  • max_batch_size (int) – The maximum number of images per batch.

Returns:

A tuple (formatted_batches, formatted_batch_data).

Return type:

tuple

Raises:

ValueError – If the protocol is invalid.

parse_output(
response: Any,
protocol: str,
data: Dict[str, Any] | None = None,
**kwargs,
) Any[source]#

Parse the output from the model’s inference response.

Parameters:
  • response (Any) – The response from the model inference.

  • protocol (str) – The protocol used (“grpc” or “http”).

  • data (dict, optional) – Additional input data passed to the function.

Returns:

The parsed output data.

Return type:

Any

Raises:

ValueError – If an invalid protocol is specified or the response format is unexpected.

postprocess_annotations(
annotation_dicts,
**kwargs,
)[source]#
prepare_data_for_inference(
data: Dict[str, Any],
) Dict[str, Any][source]#

Prepare input data for inference by resizing images and storing their original shapes.

Parameters:

data (dict) – The input data containing a list of images.

Returns:

The updated data dictionary with resized images and original image shapes.

Return type:

dict

process_inference_results(
output: Any,
protocol: str,
**kwargs,
) List[Dict[str, Any]][source]#

Process the results of the Yolox model inference and return the final annotations.

Parameters:
  • output_array (np.ndarray) – The raw output from the Yolox model.

  • kwargs (dict) – Additional parameters for processing, including thresholds and number of classes.

Returns:

A list of annotation dictionaries for each image in the batch.

Return type:

list[dict]

transform_normalized_coordinates_to_original(
results,
original_image_shapes,
)[source]#
class nv_ingest_api.internal.primitives.nim.model_interface.yolox.YoloxPageElementsModelInterface(
yolox_model_name: str = 'nemoretriever-page-elements-v2',
)[source]#

Bases: YoloxModelInterfaceBase

An interface for handling inference with yolox-page-elements model, supporting both gRPC and HTTP protocols.

name() str[source]#

Returns the name of the Yolox model interface.

Returns:

The name of the model interface.

Return type:

str

postprocess_annotations(
annotation_dicts,
**kwargs,
)[source]#
class nv_ingest_api.internal.primitives.nim.model_interface.yolox.YoloxTableStructureModelInterface[source]#

Bases: YoloxModelInterfaceBase

An interface for handling inference with yolox-graphic-elemenents model, supporting both gRPC and HTTP protocols.

name() str[source]#

Returns the name of the Yolox model interface.

Returns:

The name of the model interface.

Return type:

str

postprocess_annotations(
annotation_dicts,
**kwargs,
)[source]#
nv_ingest_api.internal.primitives.nim.model_interface.yolox.batched_overlaps(A, B)[source]#

Calculate the Intersection over Union (IoU) between two sets of bounding boxes in a batched manner. Normalization is modified to only use the area of A boxes, hence computing the overlaps. :param A: Array of bounding boxes of shape (N, 4) in format [x1, y1, x2, y2]. :type A: ndarray :param B: Array of bounding boxes of shape (M, 4) in format [x1, y1, x2, y2]. :type B: ndarray

Returns:

Array of IoU values of shape (N, M) representing the overlaps

between each pair of bounding boxes.

Return type:

ndarray

nv_ingest_api.internal.primitives.nim.model_interface.yolox.bb_iou_array(boxes, new_box)[source]#
nv_ingest_api.internal.primitives.nim.model_interface.yolox.expand_boxes(boxes, r_x=1, r_y=1)[source]#
nv_ingest_api.internal.primitives.nim.model_interface.yolox.expand_chart_bboxes(annotation_dict, labels=None)[source]#

Expand bounding boxes of charts and titles based on the bounding boxes of the other class. :param annotation_dict: output of postprocess_results, a dictionary with keys “table”, “figure”, “title”

Returns:

same as input, with expanded bboxes for charts

Return type:

annotation_dict

nv_ingest_api.internal.primitives.nim.model_interface.yolox.expand_table_bboxes(annotation_dict, labels=None)[source]#

Additional preprocessing for tables: extend the upper bounds to capture titles if any. :param annotation_dict: output of postprocess_results, a dictionary with keys “table”, “figure”, “title”

Returns:

same as input, with expanded bboxes for charts

Return type:

annotation_dict

nv_ingest_api.internal.primitives.nim.model_interface.yolox.find_boxes_inside(boxes, boxes_to_check, threshold=0.9)[source]#

Find all boxes that are inside another box based on the intersection area divided by the area of the smaller box, and removes them.

nv_ingest_api.internal.primitives.nim.model_interface.yolox.find_matching_box_fast(boxes_list, new_box, match_iou)[source]#

Reimplementation of find_matching_box with numpy instead of loops. Gives significant speed up for larger arrays (~100x). This was previously the bottleneck since the function is called for every entry in the array.

nv_ingest_api.internal.primitives.nim.model_interface.yolox.get_bbox_dict_yolox_graphic(
preds,
shape,
class_labels,
threshold_=0.1,
) Dict[str, ndarray][source]#

Extracts bounding boxes from YOLOX model predictions: - Applies thresholding - Reformats boxes - Cleans the other detections: removes the ones that are included in other detections. - If no title is found, the biggest other box is used if it is larger than 0.3*img_w. :param preds: YOLOX model predictions including bounding boxes, scores, and labels. :type preds: np.ndarray :param shape: Original image shape. :type shape: tuple :param threshold_: Score threshold to filter bounding boxes. :type threshold_: float

Returns:

Dictionary of bounding boxes, organized by class.

Return type:

Dict[str, np.ndarray]

nv_ingest_api.internal.primitives.nim.model_interface.yolox.get_bbox_dict_yolox_table(
preds,
shape,
class_labels,
threshold=0.1,
delta=0.0,
)[source]#

Extracts bounding boxes from YOLOX model predictions: - Applies thresholding - Reformats boxes - Reorders predictions

Parameters:
  • preds (np.ndarray) – YOLOX model predictions including bounding boxes, scores, and labels.

  • shape (tuple) – Original image shape.

  • config – Model configuration, including size for bounding box adjustment.

  • threshold (float) – Score threshold to filter bounding boxes.

  • delta (float) – How much the table was cropped upwards.

Returns:

Dictionary of bounding boxes, organized by class.

Return type:

dict[str, np.ndarray]

nv_ingest_api.internal.primitives.nim.model_interface.yolox.get_biggest_box(boxes, conf_type='avg')[source]#

Merges boxes by using the biggest box.

Parameters:
  • boxes (np array [n x 8]) – Boxes to merge.

  • conf_type (str, optional) – Confidence merging type. Defaults to “avg”.

Returns:

Merged box.

Return type:

np array [8]

nv_ingest_api.internal.primitives.nim.model_interface.yolox.get_weighted_box(boxes, conf_type='avg')[source]#

Merges boxes by using the weighted fusion.

Parameters:
  • boxes (np array [n x 8]) – Boxes to merge.

  • conf_type (str, optional) – Confidence merging type. Defaults to “avg”.

Returns:

Merged box.

Return type:

np array [8]

nv_ingest_api.internal.primitives.nim.model_interface.yolox.get_yolox_model_name(
yolox_http_endpoint,
default_model_name='nemoretriever-page-elements-v2',
)[source]#
nv_ingest_api.internal.primitives.nim.model_interface.yolox.match_with_title(chart_bbox, title_bboxes, iou_th=0.01)[source]#
nv_ingest_api.internal.primitives.nim.model_interface.yolox.merge_boxes(b1, b2)[source]#
nv_ingest_api.internal.primitives.nim.model_interface.yolox.merge_labels(labels, confs)[source]#

Custom function for merging labels. If all labels are the same, return the unique value. Else, return the label of the most confident non-title (class 2) box.

Parameters:
  • labels (np array [n]) – Labels.

  • confs (np array [n]) – Confidence.

Returns:

Label.

Return type:

int

nv_ingest_api.internal.primitives.nim.model_interface.yolox.postprocess_model_prediction(
prediction,
num_classes,
conf_thre=0.7,
nms_thre=0.45,
class_agnostic=False,
)[source]#
nv_ingest_api.internal.primitives.nim.model_interface.yolox.postprocess_results(
results,
original_image_shapes,
image_preproc_width,
image_preproc_height,
class_labels,
min_score=0.0,
)[source]#

For each item (==image) in results, computes annotations in the form

{“table”: [[0.0107, 0.0859, 0.7537, 0.1219, 0.9861], …],

“figure”: […], “title”: […] }

where each list of 5 floats represents a bounding box in the format [x1, y1, x2, y2, confidence]

Keep only bboxes with high enough confidence.

nv_ingest_api.internal.primitives.nim.model_interface.yolox.prefilter_boxes(
boxes,
scores,
labels,
weights,
thr,
class_agnostic=False,
)[source]#

Reformats and filters boxes. Output is a dict of boxes to merge separately.

Parameters:
  • boxes (list[np array[n x 4]]) – List of boxes. One list per model.

  • scores (list[np array[n]]) – List of confidences.

  • labels (list[np array[n]]) – List of labels.

  • weights (list) – Model weights.

  • thr (float) – Confidence threshold

  • class_agnostic (bool, optional) – If True, merge boxes from different classes. Defaults to False.

Returns:

Filtered boxes.

Return type:

dict[np array [? x 8]]

nv_ingest_api.internal.primitives.nim.model_interface.yolox.resize_image(image, target_img_size)[source]#
nv_ingest_api.internal.primitives.nim.model_interface.yolox.weighted_boxes_fusion(
boxes_list,
scores_list,
labels_list,
iou_thr=0.5,
skip_box_thr=0.0,
conf_type='avg',
merge_type='weighted',
class_agnostic=False,
)[source]#

Custom wbf implementation that supports a class_agnostic mode and a biggest box fusion. Boxes are expected to be in normalized (x0, y0, x1, y1) format.

Parameters:
  • boxes_list (list[np array[n x 4]]) – List of boxes. One list per model.

  • scores_list (list[np array[n]]) – List of confidences.

  • labels_list (list[np array[n]]) – List of labels

  • iou_thr (float, optional) – IoU threshold for matching. Defaults to 0.55.

  • skip_box_thr (float, optional) – Exclude boxes with score < skip_box_thr. Defaults to 0.0.

  • conf_type (str, optional) – Confidence merging type. Defaults to “avg”.

  • merge_type (str, optional) – Merge type “weighted” or “biggest”. Defaults to “weighted”.

  • class_agnostic (bool, optional) – If True, merge boxes from different classes. Defaults to False.

Returns:

Merged boxes, np array[N]: Merged confidences, np array[N]: Merged labels.

Return type:

np array[N x 4]

Module contents#