Reranking Model Fine-Tuning Recipe#

A complete 6-stage pipeline for fine-tuning and deploying cross-encoder reranking models on domain-specific data using synthetic data generation.

Overview#

This recipe fine-tunes NVIDIA’s Llama-Nemotron-Rerank-1B-v2 cross-encoder reranking model on your own domain data. By the end of this pipeline, you’ll have a domain-adapted reranker that improves retrieval precision by re-scoring candidate documents returned by a first-stage retriever.

Why Fine-Tune Reranking Models?#

In a typical retrieval pipeline, a fast embedding model retrieves a broad set of candidate documents, then a cross-encoder reranker re-scores each query–document pair to improve ranking quality. Fine-tuning the reranker adapts it to:

Understand domain-specific relevance signals and terminology
Better discriminate between subtly relevant and irrelevant documents
Improve precision at the top of the ranked list (nDCG@k) on your specific corpus

Embedding vs. Reranking#

Aspect	Embedding Model	Reranking Model
Architecture	Bi-encoder (encodes query and document separately)	Cross-encoder (encodes query and document together)
Speed	Fast (single encoding per document, offline indexable)	Slower (joint encoding per query–document pair)
Accuracy	Good for broad recall	Higher precision at top ranks
Role	First-stage retrieval	Second-stage re-ranking

The two models are complementary — use the embedding model to cast a wide net, then the reranker to sort the catch.

Training Pipeline#

┌─────────────────────────────────────────────────────────────────────────────┐
│                           YOUR DOCUMENT CORPUS                              │
│                    (Text files: .txt, .md, etc.)                            │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│              STAGE 0: SYNTHETIC DATA GENERATION (retriever-sdg)             │
│  ┌─────────────────┐    ┌─────────────────┐    ┌──────────────────────────┐ │
│  │ Document Chunks │ →  │  LLM Generation │ →  │ Q&A Pairs + Evaluations  │ │
│  │                 │    │  (NVIDIA API)   │    │                          │ │
│  └─────────────────┘    └─────────────────┘    └──────────────────────────┘ │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                    STAGE 1: TRAINING DATA PREPARATION                       │
│  ┌─────────────────┐    ┌─────────────────┐    ┌──────────────────────────┐ │
│  │ Train/Val/Test  │ →  │  Hard Negative  │ →  │   Multi-hop Unrolling    │ │
│  │     Split       │    │     Mining      │    │                          │ │
│  └─────────────────┘    └─────────────────┘    └──────────────────────────┘ │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│              STAGE 2: MODEL FINE-TUNING (Cross-Encoder, Automodel)          │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │  Classification Loss: Query+Passage → Relevance Score                  ││
│  └─────────────────────────────────────────────────────────────────────────┘│
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                     STAGE 3: EVALUATION (BEIR + Reranking)                  │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │   Dense Retrieval → Re-rank → Measure nDCG@k Improvement               ││
│  └─────────────────────────────────────────────────────────────────────────┘│
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                    STAGE 4: EXPORT (ONNX/TensorRT)                          │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │     Export Model to ONNX and TensorRT for Optimized Inference           ││
│  └─────────────────────────────────────────────────────────────────────────┘│
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                        STAGE 5: DEPLOY (NIM)                                │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │     Launch NIM Container with Custom Model for Ranking API              ││
│  └─────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘

Stage	Command	Description	Output
Stage 0: SDG	`nemotron rerank sdg`	Validate corpus, generate synthetic Q&A pairs from documents	Q&A pairs with quality scores
Stage 1: Data Prep	`nemotron rerank prep`	Convert, mine hard negatives, unroll	Training-ready data
Stage 2: Finetune	`nemotron rerank finetune`	Fine-tune cross-encoder reranking model	Model checkpoint
Stage 3: Eval	`nemotron rerank eval`	Evaluate reranking improvement over first-stage retrieval	Metrics comparison
Stage 4: Export	`nemotron rerank export`	Export to ONNX/TensorRT	Optimized inference models
Stage 5: Deploy	`nemotron rerank deploy`	Deploy NIM ranking service	Running inference service

Installation#

1. Install UV Package Manager#

This project requires UV as its package manager. UV automatically creates and manages a virtual environment under the repository root, and each pipeline stage uses its own isolated environment as well. Do not use pip install — the project relies on UV workspaces and per-stage dependency isolation.

curl -LsSf https://astral.sh/uv/install.sh | sh

2. Clone and Install Nemotron#

# Clone the repository
git clone https://github.com/NVIDIA-NeMo/Nemotron.git
cd Nemotron

# UV creates a virtual environment at .venv/ and installs all dependencies
uv sync --all-extras

3. Get Your NVIDIA API Key#

The SDG stage (Stage 0) uses NVIDIA’s hosted LLM APIs for synthetic data generation.

Sign up at build.nvidia.com
Create an API key
Set the environment variable:

export NVIDIA_API_KEY=nvapi-your_key_here

4. Configure Execution Profiles (Optional)#

For Docker or Slurm execution, create env.toml in the repository root directory.

Minimal configuration (local execution only):

[wandb]
project = "my-reranking-project"
entity = "my-username"

Full configuration with Docker and Slurm support: See the Execution Profiles section below.

Preparing Your Corpus#

Supported Formats#

Any text files, default: .txt, .md, .text, and files with no extension
Documents should be UTF-8 encoded
Files are processed recursively from the corpus directory

Corpus Size Recommendations#

Corpus Size	Documents	Expected Results
Minimum	50-100 docs (~50K tokens)	Basic domain adaptation
Recommended	500+ docs	Good domain coverage
Optimal	1000+ docs	Best performance

Document Organization#

Organize your documents in a directory structure:

data/corpus/
├── doc1.txt
├── doc2.md
└── subdirectory/
    └── doc3.txt

All files matching the file_extensions config (default: .txt,.md,.text, plus extensionless files) will be processed recursively.

Prerequisites#

Hardware Requirements#

GPU: NVIDIA GPU with at least 80GB VRAM (e.g., A100, H100)
- Stage 0 uses NVIDIA API (no GPU required)
- Stages 1-5: Require GPU for model inference and training
CPU: Modern multi-core processor (16+ cores recommended)
Memory: 128GB+ RAM recommended
Storage: ~50GB free disk space for outputs, models, and containers

Software Requirements#

Python: 3.12 or later
UV: Package manager (installation instructions above)
NVIDIA API Key: Required for synthetic data generation
NGC API Key: Required for authenticated NIM deployment
NVIDIA GPU Drivers: Latest drivers for your GPU
Docker (optional): For containerized execution or NIM deployment
Slurm (optional): For cluster execution

Expected Runtime & Resources#

Stage	GPU VRAM	CPU	Notes
Stage 0 (SDG)	N/A	8+ cores	Uses API (no local GPU); runtime varies by dataset size
Stage 1 (Data Prep)	40GB	16+ cores	Hard negative mining on GPU; runtime varies by dataset size
Stage 2 (Finetune)	80GB	16+ cores	Runtime varies by dataset size and epochs
Stage 3 (Eval)	40GB	8+ cores	Dense retrieval + reranking; runtime varies by dataset size
Stage 4 (Export)	40GB	8+ cores	TensorRT export requires NGC container
Stage 5 (Deploy)	40GB	4+ cores	NIM container initialization

LLM API Usage (Stage 0)#

Stage 0 uses LLM APIs for synthetic data generation. By default, it uses NVIDIA’s hosted LLMs:

Default provider: NVIDIA API (free tier available at build.nvidia.com)
Default model: nvidia/nemotron-3-nano-30b-a3b (fast, reliable for structured generation)
Usage: ~4 API calls per document (artifact extraction, QA generation, dedup, quality eval)
Cost: Free tier has rate limits; contact NVIDIA for production usage

Quick Start#

Local Execution#

# Set environment (important for CUDA compatibility)
export LD_LIBRARY_PATH=""
export NVIDIA_API_KEY=nvapi-your_key_here
export NGC_API_KEY=ngc-your_key_here

# Stage 0: Generate synthetic Q&A pairs from your documents
nemotron rerank sdg -c default corpus_dir=/path/to/your/docs

# Stage 1: Prepare training data (convert, mine hard negatives, unroll)
nemotron rerank prep -c default

# Stage 2: Fine-tune the cross-encoder reranking model
nemotron rerank finetune -c default

# Stage 3: Evaluate base vs fine-tuned reranker
nemotron rerank eval -c default

# Stage 4: Export to ONNX/TensorRT for NIM deployment
nemotron rerank export -c default

# Optional Stage 5: Deploy NeMo Retriever Reranking NIM with the Stage 4 ONNX export.
# Deploy mounts ./output/rerank/stage4_export/onnx by default; set model_dir to
# the tensorrt directory if you exported a TensorRT engine.
nemotron rerank deploy -c default detach=true

# Optional: Verify NIM accuracy matches checkpoint
nemotron rerank eval -c default eval_nim=true eval_base=false

Preview Commands (Dry Run)#

# See what would be executed without running
nemotron rerank finetune -c default --dry-run

Pipeline Flexibility#

Stages are designed to run sequentially, but you can start from any stage if you have the required inputs:

Start From	Requirement	Use Case
Stage 0	Document corpus	Full pipeline from scratch
Stage 1	Q&A pairs (JSON)	Skip SDG if you have labeled data or use NVIDIA’s pre-generated dataset
Stage 2	Training data (Automodel format)	Skip data prep if data is ready
Stage 3	Model checkpoint	Evaluate existing checkpoint
Stage 4	Model checkpoint	Export existing model
Stage 5	Reranking NIM image; ONNX or TensorRT export directory	Deploy the ranking API

See individual stage configs for input format requirements.

Using NVIDIA’s Pre-Generated Dataset#

NVIDIA provides a ready-to-use synthetic retrieval dataset on Hugging Face: Retrieval-Synthetic-NVDocs-v1. This dataset was generated from NVIDIA’s publicly available content using the same SDG pipeline (Stage 0) in this recipe, and contains ~15K documents with 105K+ question-answer pairs across multiple reasoning types.

If you want to fine-tune a reranking model on NVIDIA-related content, you can skip Stage 0 entirely and start directly from Stage 1:

# Download the pre-generated dataset
python -c "
from datasets import load_dataset
ds = load_dataset('nvidia/Retrieval-Synthetic-NVDocs-v1', split='train')
ds.to_json('./output/rerank/stage0_sdg/nv_docs_sdg.json')
"

# Start from Stage 1 (data preparation) using the downloaded data
nemotron rerank prep -c default sdg_input_path=./output/rerank/stage0_sdg

# Continue with the rest of the pipeline
nemotron rerank finetune -c default
nemotron rerank eval -c default

Execution Modes#

The rerank recipe supports multiple execution modes for flexibility between local development and production cluster runs.

Local Execution (Default)#

Run directly on your local machine with GPU:

nemotron rerank finetune -c default
nemotron rerank eval -c default

Docker Execution#

Run inside a Docker container with GPU passthrough using --run local-docker:

# Runs the command inside a Docker container with GPU access
nemotron rerank finetune -c default --run local-docker

# Stage commands through export can run in Docker; deploy is a local-only Docker wrapper.
nemotron rerank sdg -c default --run local-docker
nemotron rerank prep -c default --run local-docker
nemotron rerank eval -c default --run local-docker

Note: Requires local-docker profile in env.toml (see Execution Profiles below)

Slurm Batch Execution#

Submit jobs to a Slurm cluster for production workloads:

# Attached execution (waits for completion, streams logs via SSH)
nemotron rerank finetune -c default --run my-cluster

# Detached execution (submits job and exits immediately)
nemotron rerank finetune -c default --batch my-cluster

# Run the SDG through eval pipeline as one sequential cluster experiment
# Requires remote_job_dir in the execution profile so stages share outputs
nemotron rerank run -c default --batch my-cluster --from sdg --to eval

# Or submit stages individually
nemotron rerank sdg -c default --batch my-cluster
nemotron rerank prep -c default --batch my-cluster
nemotron rerank finetune -c default --batch my-cluster
nemotron rerank eval -c default --batch my-cluster

Execution Profiles#

Execution profiles are defined in env.toml in the repository root directory.

Example env.toml for local and cluster execution:

# Weights & Biases configuration (optional but recommended)
[wandb]
project = "my-reranking-project"
entity = "my-team"

# Local Docker execution profile
[local-docker]
executor = "docker"
container_image = "nvcr.io/nvidia/nemo-automodel:26.04"
runtime = "nvidia"  # Enable GPU passthrough
ipc_mode = "host"
shm_size = "16g"
mounts = [
    "./data:/workspace/data",
    "./output:/workspace/output"
]

# Slurm cluster execution profile
[my-cluster]
executor = "slurm"
account = "my-account"
partition = "interactive"
batch_partition = "batch"
container_image = "nvcr.io/nvidia/nemo-automodel:26.04"
tunnel = "ssh"
host = "cluster.example.com"
user = "username"
remote_job_dir = "/shared/path/to/jobs"
mounts = ["/shared:/shared"]

Runtime Overrides#

Override execution settings on the command line:

# Use more GPUs
nemotron rerank finetune -c default --run my-cluster run.env.gpus_per_node=4

# Use different partition
nemotron rerank finetune -c default --batch my-cluster run.env.partition=batch

# Override time limit
nemotron rerank finetune -c default --batch my-cluster run.env.time=08:00:00

Remote Job Inspection#

Use dry runs to inspect commands and generated job configuration before submitting cluster work:

nemotron rerank finetune -c default --run my-cluster --dry-run
nemotron rerank eval -c default --batch my-cluster --dry-run

Rerank cluster execution requires a profile with remote_job_dir or env_vars.NEMO_RUN_DIR for multi-stage rerank run pipelines so stage outputs share the same run directory. Local Docker execution does not require remote_job_dir; it uses the Docker work directory and declared mounts. deploy is local-only, and --stage is rejected for rerank commands until staged-file execution is implemented.

Configuration#

Each stage has a config/ directory with YAML configuration files.

File	Purpose
`default.yaml`	Production-ready configuration

Key Configuration Options#

The default config points at a small pinned sample corpus. For your own corpus, override the Stage 0 paths:

Stage 0: SDG

corpus_id: my_corpus           # Identifier for your corpus
corpus_dir: ./data/corpus      # Path to your documents
file_extensions: ".txt,.md,.text,"  # File types to process, including extensionless files
output_dir: ./output/rerank/stage0_sdg  # Path to save the generated data
artifact_extraction_model: nvidia/nemotron-3-nano-30b-a3b  # LLM for document artifacts extraction
qa_generation_model: nvidia/nemotron-3-nano-30b-a3b        # LLM for QA generation
quality_judge_model: nvidia/nemotron-3-nano-30b-a3b        # LLM for QA quality evaluation

Stage 1: Data Prep

base_model: nvidia/llama-nemotron-embed-1b-v2  # Embedding model for hard negative mining
quality_threshold: 7.0         # Minimum Q&A quality score (0-10)
hard_negatives_to_mine: 5      # Number of hard negatives per query
query_max_length: 512          # Max query tokens
passage_max_length: 512        # Max passage tokens
train_ratio: 0.8               # Training data split (80%)
val_ratio: 0                   # Validation split (0% — maximizes train/test for small data)
test_ratio: 0.2                # Test split (20%)

Stage 2: Finetune

base_model: nvidia/llama-nemotron-rerank-1b-v2
num_epochs: 3
global_batch_size: 128
learning_rate: 3.0e-6
lr_warmup_steps: 100
lr_decay_style: cosine
weight_decay: 0.01
optimizer_backend: auto           # FusedAdam in the container, FlashAdamW locally without TE
flash_adamw_master_weight_bits: 32 # Effective master-weight precision for FlashAdamW
rerank_max_length: 512         # Max tokens for concatenated query+passage
prompt_template: "question:{query} \n \n passage:{passage}"
train_n_passages: 5            # 1 positive + 4 hard negatives
num_labels: 1
temperature: 1.0

Warning — Overfitting risk: The default num_epochs: 3 is set for the small example dataset shipped with this recipe, where fewer epochs may not produce a visible training signal. For most real-world datasets, 1–2 epochs is sufficient and 3 epochs carries a high risk of overfitting. Lower this value when working with your own data (e.g., nemotron rerank finetune -c default num_epochs=1).

Stage 3: Eval

base_model: nvidia/llama-nemotron-rerank-1b-v2   # Base reranker for comparison
retrieval_model: nvidia/llama-nemotron-embed-1b-v2  # First-stage retriever
k_values: [1, 5, 10, 100]     # K values for nDCG@k, Recall@k
top_k: 100                    # Number of first-stage candidates to re-rank
max_length: 512                # Max sequence length
eval_base: true                # Evaluate base reranker
eval_finetuned: true           # Evaluate fine-tuned reranker
eval_nim: false                # Evaluate NIM endpoint

Stage 4: Export

model_path: ./output/rerank/stage2_finetune/checkpoints/LATEST/model/consolidated
export_to_trt: false           # Export to TensorRT (requires nemo:25.07+ container)
quant_cfg: null                # Quantization: null, "fp8", "int8_sq"
calibration_query: what information is relevant to this query?
prompt_template: "question:{query} \n \n passage:{passage}"
trt_opt_batch: 16              # Optimal batch size for TRT
trt_opt_seq_len: 256           # Optimal sequence length for TRT

Stage 5: Deploy

nim_image: nvcr.io/nim/nvidia/llama-nemotron-rerank-1b-v2:1.10.0
model_dir: ./output/rerank/stage4_export/onnx  # ONNX or TensorRT export directory
bind_address: 127.0.0.1        # Host interface for the NIM ranking API
host_port: 8000                # Host port for the NIM ranking API
detach: false                  # Run in background
replace_existing: false        # Require explicit opt-in before replacing a running container
keep_failed_container: false   # Remove unhealthy detached containers after timeout

Overriding Configuration#

Override config values on the command line:

# Override training epochs
nemotron rerank finetune -c default num_epochs=5

# Override learning rate
nemotron rerank finetune -c default learning_rate=1e-5

# Override multiple values
nemotron rerank finetune -c default num_epochs=2 learning_rate=1e-5

# Force specific attention implementation
nemotron rerank finetune -c default attn_implementation=flash_attention_2

CLI Commands#

Workspace Info#

# Display workflow overview
nemotron rerank info

Data#

# Generate synthetic Q&A pairs from documents
nemotron rerank sdg -c default corpus_dir=/path/to/docs

# Prepare training data (convert, mine, unroll)
nemotron rerank prep -c default sdg_input_path=/path/to/sdg

Training#

# Fine-tune the cross-encoder reranking model
nemotron rerank finetune -c default train_data_path=/path/to/data

Evaluation#

# Evaluate base and fine-tuned rerankers
nemotron rerank eval -c default finetuned_model_path=/path/to/checkpoint

Export#

# Export model to ONNX
nemotron rerank export -c default model_path=/path/to/checkpoint

# Export to ONNX only (skip TensorRT)
nemotron rerank export -c default export_to_trt=false

# Export with FP8 quantization
nemotron rerank export -c default quant_cfg=fp8

Deploy#

# Deploy the Stage 4 ONNX export in background (detached mode)
nemotron rerank deploy -c default detach=true

# Deploy a TensorRT export instead
nemotron rerank deploy -c default model_dir=./output/rerank/stage4_export/tensorrt

# Serve the image default model instead of a fine-tuned export
nemotron rerank deploy -c default model_dir=null

# Replace a recipe-owned container with the same name
nemotron rerank deploy -c default detach=true replace_existing=true

# Stop the NIM container
docker stop nemotron-rerank-nim

Verify NIM Accuracy#

# Evaluate NIM endpoint against fine-tuned checkpoint
nemotron rerank eval -c default eval_nim=true eval_base=false

# The output will show if NIM metrics match the checkpoint
# ok       indicates metrics match within tolerance (0.03 for @1, 0.01 for @5+)
# MISMATCH indicates potential accuracy loss beyond ONNX/TensorRT conversion noise

Test the Deployed Service#

curl -X POST http://localhost:8000/v1/ranking \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "nvidia/llama-nemotron-rerank-1b-v2",
    "query": {"text": "what is AI?"},
    "passages": [
      {"text": "AI is artificial intelligence"},
      {"text": "The weather is sunny today"}
    ],
    "truncate": "END"
  }'

Output Structure#

After running the full pipeline:

output/rerank/
├── stage0_sdg/                    # Synthetic Q&A pairs
│   └── generated_batch*.json
├── stage1_prep/                   # Training-ready data
│   ├── train.json                 # Original training data
│   ├── train_mined.automodel.json # With hard negatives
│   ├── train_mined.automodel_unrolled.json  # Final training file
│   ├── corpus/                    # Document corpus
│   └── eval_beir/                 # BEIR-format evaluation data
├── stage2_finetune/               # Model checkpoints
│   └── checkpoints/
│       └── LATEST/model/consolidated/  # Final model
├── stage3_eval/                   # Evaluation results
│   └── eval_results.json
└── stage4_export/                 # Exported models
    ├── onnx/                      # ONNX model files mounted by deploy by default
    │   └── model.onnx
    └── tensorrt/                  # TensorRT engine (if enabled)
        └── model.plan

Evaluation Metrics#

The evaluation stage measures reranking quality using a two-stage approach: first-stage dense retrieval followed by cross-encoder re-ranking. This mirrors how rerankers are used in production.

Metric	Description	Range
nDCG@k	Normalized Discounted Cumulative Gain (ranking quality)	0.0-1.0
Recall@k	Fraction of relevant documents in top-k results	0.0-1.0
Precision@k	Fraction of retrieved documents that are relevant	0.0-1.0
MAP@k	Mean Average Precision	0.0-1.0

Higher scores indicate better re-ranking performance. The key metric to watch is nDCG@10, which captures how well the reranker promotes relevant documents to the top of the list.

Key Components#

Component	Purpose	Repository
retriever-sdg	Synthetic data generation using NeMo Data Designer	GitHub
Automodel	Cross-encoder model training framework	GitHub
BEIR	Evaluation framework for information retrieval	GitHub
NeMo Export-Deploy	ONNX/TensorRT export for optimized inference	GitHub
NVIDIA NIM	Production inference microservice with ranking API	Developer Site

Base Model#

Property	Value
Model	nvidia/llama-nemotron-rerank-1b-v2
Parameters	~1B
Architecture	Cross-encoder (sequence classification)
Max Sequence Length	512 (concatenated query + passage)
Pooling	Average
HuggingFace	Model Card

Support#

For issues, questions, or contributions:

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: Nemotron Documentation