nemo_retriever.model.local package#
Submodules#
nemo_retriever.model.local.llama_nemotron_embed_1b_v2_embedder module#
nemo_retriever.model.local.llama_nemotron_embed_1b_v2_hf_embedder module#
nemo_retriever.model.local.llama_nemotron_embed_vl_1b_v2_embedder module#
nemo_retriever.model.local.nemotron_graphic_elements_v1 module#
nemo_retriever.model.local.nemotron_ocr_v1 module#
nemo_retriever.model.local.nemotron_ocr_v2 module#
nemo_retriever.model.local.nemotron_page_elements_v3 module#
nemo_retriever.model.local.nemotron_parse_v1_2 module#
nemo_retriever.model.local.nemotron_rerank_v2 module#
Local wrapper for nvidia/llama-nemotron-rerank-1b-v2 cross-encoder reranker.
- class nemo_retriever.model.local.nemotron_rerank_v2.NemotronRerankV2(
- model_name: str = 'nvidia/llama-nemotron-rerank-1b-v2',
- device: str | None = None,
- hf_cache_dir: str | None = None,
Bases:
BaseModelLocal cross-encoder reranker wrapping nvidia/llama-nemotron-rerank-1b-v2.
The model scores (query, document) pairs and returns raw logits; higher values indicate greater relevance. It is fine-tuned from meta-llama/Llama-3.2-1B with bi-directional attention and supports 26 languages with sequences up to 8 192 tokens.
Example:
reranker = NemotronRerankV2() scores = reranker.score("What is ML?", ["Machine learning is…", "Paris is…"]) # scores -> [20.6, -23.1] (higher = more relevant)
- property input#
Input schema or object.
- property input_batch_size: int#
Maximum or default input batch size.
- property model_name: str#
Human-readable model name.
- property model_runmode: Literal['local', 'NIM', 'build-endpoint']#
local, NIM, or build-endpoint.
- Type:
Execution mode
- property model_type: str#
Model category/type (e.g. llm, vision, embedding).
- property output#
Output schema or object.
- score(
- query: str,
- documents: List[str],
- *,
- max_length: int = 512,
- batch_size: int = 32,
Score relevance of documents to query.
- Parameters:
query – The search query.
documents – Candidate passages/documents to score.
max_length – Tokenizer truncation length (default 512; max supported 8 192).
batch_size – Number of (query, doc) pairs to process per GPU forward pass.
- Returns:
Raw logit scores aligned with documents (higher = more relevant).
- Return type:
List[float]
- score_pairs(
- pairs: List[tuple],
- *,
- max_length: int = 512,
- batch_size: int = 32,
Score a list of (query, document) pairs.
- Parameters:
pairs – Sequence of
(query, document)tuples.max_length – Tokenizer truncation length.
batch_size – GPU forward-pass batch size.
- Returns:
Raw logit scores (higher = more relevant).
- Return type:
List[float]
nemo_retriever.model.local.nemotron_rerank_vl_v2 module#
vLLM-backed local wrapper for nvidia/llama-nemotron-rerank-vl-1b-v2 VL cross-encoder reranker.
- class nemo_retriever.model.local.nemotron_rerank_vl_v2.NemotronRerankVLV2VLLM(
- model_name: str = 'nvidia/llama-nemotron-rerank-vl-1b-v2',
- device: str | None = None,
- hf_cache_dir: str | None = None,
- gpu_memory_utilization: float = 0.5,
Bases:
BaseModelvLLM-backed VL cross-encoder reranker wrapping nvidia/llama-nemotron-rerank-vl-1b-v2.
Uses vLLM’s pooling runner (
llm.score()) instead of HuggingFaceAutoModelForSequenceClassification. This provides better throughput through continuous batching and optimised attention kernels.The public API (
score(),score_pairs()) is identical toNemotronRerankVLV2so callers need not change.Example:
reranker = NemotronRerankVLV2VLLM() scores = reranker.score( "What is ML?", ["Machine learning is…", "Paris is…"], images_b64=["iVBOR...", None], )
- property input#
Input schema or object.
- property input_batch_size: int#
Maximum or default input batch size.
- property model_name: str#
Human-readable model name.
- property model_runmode: Literal['local', 'NIM', 'build-endpoint']#
local, NIM, or build-endpoint.
- Type:
Execution mode
- property model_type: str#
Model category/type (e.g. llm, vision, embedding).
- property output#
Output schema or object.
- score(
- query: str,
- documents: List[str],
- *,
- images_b64: Sequence[str | None] | None = None,
- max_length: int = 10240,
- batch_size: int = 32,
Score relevance of documents (with optional images) to query.
- Parameters:
query – The search query.
documents – Candidate passages/documents to score.
images_b64 – Optional base64-encoded images aligned with documents. Entries may be
Nonefor documents without images (text-only fallback).max_length – Unused (kept for API compatibility). Document text is automatically truncated to fit
max_model_len.batch_size – Unused (kept for API compatibility). vLLM handles batching internally via continuous batching.
- Returns:
Raw logit scores aligned with documents (higher = more relevant).
- Return type:
List[float]
- score_pairs(
- pairs: List[tuple],
- *,
- images_b64: Sequence[str | None] | None = None,
- max_length: int = 10240,
- batch_size: int = 32,
Score a list of (query, document) pairs with optional images.
- Parameters:
pairs – Sequence of
(query, document)tuples.images_b64 – Optional base64-encoded images aligned with pairs.
max_length – Unused (API compatibility). Document text is automatically truncated to fit
max_model_len.batch_size – Unused (API compatibility).
- Returns:
Raw logit scores (higher = more relevant).
- Return type:
List[float]
nemo_retriever.model.local.nemotron_rerank_vl_v2_hf module#
Local wrapper for nvidia/llama-nemotron-rerank-vl-1b-v2 VL cross-encoder reranker.
- class nemo_retriever.model.local.nemotron_rerank_vl_v2_hf.NemotronRerankVLV2(
- model_name: str = 'nvidia/llama-nemotron-rerank-vl-1b-v2',
- device: str | None = None,
- hf_cache_dir: str | None = None,
Bases:
BaseModelLocal VL cross-encoder reranker wrapping nvidia/llama-nemotron-rerank-vl-1b-v2.
Scores (query, document, image) triplets and returns raw logits; higher values indicate greater relevance. When an image is
Nonefor a given document, the model falls back to text-only scoring for that pair.Unlike the text-only
NemotronRerankV2which usesAutoTokenizerand a manual prompt template, this model usesAutoProcessorwithprocess_queries_documents_crossencoder()to handle vision token insertion.Example:
reranker = NemotronRerankVLV2() scores = reranker.score( "What is ML?", ["Machine learning is…", "Paris is…"], images_b64=["iVBOR...", None], )
- property input#
Input schema or object.
- property input_batch_size: int#
Maximum or default input batch size.
- property model_name: str#
Human-readable model name.
- property model_runmode: Literal['local', 'NIM', 'build-endpoint']#
local, NIM, or build-endpoint.
- Type:
Execution mode
- property model_type: str#
Model category/type (e.g. llm, vision, embedding).
- property output#
Output schema or object.
- score(
- query: str,
- documents: List[str],
- *,
- images_b64: Sequence[str | None] | None = None,
- max_length: int = 10240,
- batch_size: int = 32,
Score relevance of documents (with optional images) to query.
- Parameters:
query – The search query.
documents – Candidate passages/documents to score.
images_b64 – Optional base64-encoded images aligned with documents. Entries may be
Nonefor documents without images (text-only fallback).max_length – Processor truncation length.
batch_size – Number of triplets to process per GPU forward pass.
- Returns:
Raw logit scores aligned with documents (higher = more relevant).
- Return type:
List[float]
- score_pairs(
- pairs: List[tuple],
- *,
- images_b64: Sequence[str | None] | None = None,
- max_length: int = 10240,
- batch_size: int = 32,
Score a list of (query, document) pairs with optional images.
- Parameters:
pairs – Sequence of
(query, document)tuples.images_b64 – Optional base64-encoded images aligned with pairs.
max_length – Processor truncation length.
batch_size – GPU forward-pass batch size.
- Returns:
Raw logit scores (higher = more relevant).
- Return type:
List[float]
nemo_retriever.model.local.nemotron_table_structure_v1 module#
nemo_retriever.model.local.nemotron_vlm_captioner module#
nemo_retriever.model.local.parakeet_ctc_1_1b_asr module#
Module contents#
Local model implementations for slim-gest.
This module contains implementations of locally-runnable models that extend the BaseModel abstract class. Exports are lazy-loaded so that importing a single submodule (e.g. parakeet_ctc_1_1b_asr) does not pull in torch-dependent modules, allowing unit tests with minimal deps to run.