nemo_retriever.model.local package#

Submodules#

nemo_retriever.model.local.llama_nemotron_embed_1b_v2_embedder module#

nemo_retriever.model.local.llama_nemotron_embed_1b_v2_hf_embedder module#

nemo_retriever.model.local.llama_nemotron_embed_vl_1b_v2_embedder module#

nemo_retriever.model.local.nemotron_graphic_elements_v1 module#

nemo_retriever.model.local.nemotron_ocr_v1 module#

nemo_retriever.model.local.nemotron_ocr_v2 module#

nemo_retriever.model.local.nemotron_page_elements_v3 module#

nemo_retriever.model.local.nemotron_parse_v1_2 module#

nemo_retriever.model.local.nemotron_rerank_v2 module#

Local wrapper for nvidia/llama-nemotron-rerank-1b-v2 cross-encoder reranker.

class nemo_retriever.model.local.nemotron_rerank_v2.NemotronRerankV2(
model_name: str = 'nvidia/llama-nemotron-rerank-1b-v2',
device: str | None = None,
hf_cache_dir: str | None = None,
)[source]#

Bases: BaseModel

Local cross-encoder reranker wrapping nvidia/llama-nemotron-rerank-1b-v2.

The model scores (query, document) pairs and returns raw logits; higher values indicate greater relevance. It is fine-tuned from meta-llama/Llama-3.2-1B with bi-directional attention and supports 26 languages with sequences up to 8 192 tokens.

Example:

reranker = NemotronRerankV2()
scores = reranker.score("What is ML?", ["Machine learning is…", "Paris is…"])
# scores -> [20.6, -23.1]  (higher = more relevant)
property input#

Input schema or object.

property input_batch_size: int#

Maximum or default input batch size.

property model_name: str#

Human-readable model name.

property model_runmode: Literal['local', 'NIM', 'build-endpoint']#

local, NIM, or build-endpoint.

Type:

Execution mode

property model_type: str#

Model category/type (e.g. llm, vision, embedding).

property output#

Output schema or object.

score(
query: str,
documents: List[str],
*,
max_length: int = 512,
batch_size: int = 32,
) List[float][source]#

Score relevance of documents to query.

Parameters:
  • query – The search query.

  • documents – Candidate passages/documents to score.

  • max_length – Tokenizer truncation length (default 512; max supported 8 192).

  • batch_size – Number of (query, doc) pairs to process per GPU forward pass.

Returns:

Raw logit scores aligned with documents (higher = more relevant).

Return type:

List[float]

score_pairs(
pairs: List[tuple],
*,
max_length: int = 512,
batch_size: int = 32,
) List[float][source]#

Score a list of (query, document) pairs.

Parameters:
  • pairs – Sequence of (query, document) tuples.

  • max_length – Tokenizer truncation length.

  • batch_size – GPU forward-pass batch size.

Returns:

Raw logit scores (higher = more relevant).

Return type:

List[float]

nemo_retriever.model.local.nemotron_rerank_vl_v2 module#

vLLM-backed local wrapper for nvidia/llama-nemotron-rerank-vl-1b-v2 VL cross-encoder reranker.

class nemo_retriever.model.local.nemotron_rerank_vl_v2.NemotronRerankVLV2VLLM(
model_name: str = 'nvidia/llama-nemotron-rerank-vl-1b-v2',
device: str | None = None,
hf_cache_dir: str | None = None,
gpu_memory_utilization: float = 0.5,
)[source]#

Bases: BaseModel

vLLM-backed VL cross-encoder reranker wrapping nvidia/llama-nemotron-rerank-vl-1b-v2.

Uses vLLM’s pooling runner (llm.score()) instead of HuggingFace AutoModelForSequenceClassification. This provides better throughput through continuous batching and optimised attention kernels.

The public API (score(), score_pairs()) is identical to NemotronRerankVLV2 so callers need not change.

Example:

reranker = NemotronRerankVLV2VLLM()
scores = reranker.score(
    "What is ML?",
    ["Machine learning is…", "Paris is…"],
    images_b64=["iVBOR...", None],
)
property input#

Input schema or object.

property input_batch_size: int#

Maximum or default input batch size.

property model_name: str#

Human-readable model name.

property model_runmode: Literal['local', 'NIM', 'build-endpoint']#

local, NIM, or build-endpoint.

Type:

Execution mode

property model_type: str#

Model category/type (e.g. llm, vision, embedding).

property output#

Output schema or object.

score(
query: str,
documents: List[str],
*,
images_b64: Sequence[str | None] | None = None,
max_length: int = 10240,
batch_size: int = 32,
) List[float][source]#

Score relevance of documents (with optional images) to query.

Parameters:
  • query – The search query.

  • documents – Candidate passages/documents to score.

  • images_b64 – Optional base64-encoded images aligned with documents. Entries may be None for documents without images (text-only fallback).

  • max_length – Unused (kept for API compatibility). Document text is automatically truncated to fit max_model_len.

  • batch_size – Unused (kept for API compatibility). vLLM handles batching internally via continuous batching.

Returns:

Raw logit scores aligned with documents (higher = more relevant).

Return type:

List[float]

score_pairs(
pairs: List[tuple],
*,
images_b64: Sequence[str | None] | None = None,
max_length: int = 10240,
batch_size: int = 32,
) List[float][source]#

Score a list of (query, document) pairs with optional images.

Parameters:
  • pairs – Sequence of (query, document) tuples.

  • images_b64 – Optional base64-encoded images aligned with pairs.

  • max_length – Unused (API compatibility). Document text is automatically truncated to fit max_model_len.

  • batch_size – Unused (API compatibility).

Returns:

Raw logit scores (higher = more relevant).

Return type:

List[float]

unload() None[source]#

Release GPU memory held by the vLLM engine.

nemo_retriever.model.local.nemotron_rerank_vl_v2_hf module#

Local wrapper for nvidia/llama-nemotron-rerank-vl-1b-v2 VL cross-encoder reranker.

class nemo_retriever.model.local.nemotron_rerank_vl_v2_hf.NemotronRerankVLV2(
model_name: str = 'nvidia/llama-nemotron-rerank-vl-1b-v2',
device: str | None = None,
hf_cache_dir: str | None = None,
)[source]#

Bases: BaseModel

Local VL cross-encoder reranker wrapping nvidia/llama-nemotron-rerank-vl-1b-v2.

Scores (query, document, image) triplets and returns raw logits; higher values indicate greater relevance. When an image is None for a given document, the model falls back to text-only scoring for that pair.

Unlike the text-only NemotronRerankV2 which uses AutoTokenizer and a manual prompt template, this model uses AutoProcessor with process_queries_documents_crossencoder() to handle vision token insertion.

Example:

reranker = NemotronRerankVLV2()
scores = reranker.score(
    "What is ML?",
    ["Machine learning is…", "Paris is…"],
    images_b64=["iVBOR...", None],
)
property input#

Input schema or object.

property input_batch_size: int#

Maximum or default input batch size.

property model_name: str#

Human-readable model name.

property model_runmode: Literal['local', 'NIM', 'build-endpoint']#

local, NIM, or build-endpoint.

Type:

Execution mode

property model_type: str#

Model category/type (e.g. llm, vision, embedding).

property output#

Output schema or object.

score(
query: str,
documents: List[str],
*,
images_b64: Sequence[str | None] | None = None,
max_length: int = 10240,
batch_size: int = 32,
) List[float][source]#

Score relevance of documents (with optional images) to query.

Parameters:
  • query – The search query.

  • documents – Candidate passages/documents to score.

  • images_b64 – Optional base64-encoded images aligned with documents. Entries may be None for documents without images (text-only fallback).

  • max_length – Processor truncation length.

  • batch_size – Number of triplets to process per GPU forward pass.

Returns:

Raw logit scores aligned with documents (higher = more relevant).

Return type:

List[float]

score_pairs(
pairs: List[tuple],
*,
images_b64: Sequence[str | None] | None = None,
max_length: int = 10240,
batch_size: int = 32,
) List[float][source]#

Score a list of (query, document) pairs with optional images.

Parameters:
  • pairs – Sequence of (query, document) tuples.

  • images_b64 – Optional base64-encoded images aligned with pairs.

  • max_length – Processor truncation length.

  • batch_size – GPU forward-pass batch size.

Returns:

Raw logit scores (higher = more relevant).

Return type:

List[float]

unload() None[source]#

Release GPU memory held by the model and processor.

nemo_retriever.model.local.nemotron_table_structure_v1 module#

nemo_retriever.model.local.nemotron_vlm_captioner module#

nemo_retriever.model.local.parakeet_ctc_1_1b_asr module#

Module contents#

Local model implementations for slim-gest.

This module contains implementations of locally-runnable models that extend the BaseModel abstract class. Exports are lazy-loaded so that importing a single submodule (e.g. parakeet_ctc_1_1b_asr) does not pull in torch-dependent modules, allowing unit tests with minimal deps to run.