Reranking Model Fine-Tuning Recipe#
A complete 6-stage pipeline for fine-tuning and deploying cross-encoder reranking models on domain-specific data using synthetic data generation.
Overview#
This recipe fine-tunes NVIDIA’s Llama-Nemotron-Rerank-1B-v2 cross-encoder reranking model on your own domain data. By the end of this pipeline, you’ll have a domain-adapted reranker that improves retrieval precision by re-scoring candidate documents returned by a first-stage retriever.
Why Fine-Tune Reranking Models?#
In a typical retrieval pipeline, a fast embedding model retrieves a broad set of candidate documents, then a cross-encoder reranker re-scores each query–document pair to improve ranking quality. Fine-tuning the reranker adapts it to:
Understand domain-specific relevance signals and terminology
Better discriminate between subtly relevant and irrelevant documents
Improve precision at the top of the ranked list (nDCG@k) on your specific corpus
Embedding vs. Reranking#
Aspect |
Embedding Model |
Reranking Model |
|---|---|---|
Architecture |
Bi-encoder (encodes query and document separately) |
Cross-encoder (encodes query and document together) |
Speed |
Fast (single encoding per document, offline indexable) |
Slower (joint encoding per query–document pair) |
Accuracy |
Good for broad recall |
Higher precision at top ranks |
Role |
First-stage retrieval |
Second-stage re-ranking |
The two models are complementary — use the embedding model to cast a wide net, then the reranker to sort the catch.
Training Pipeline#
┌─────────────────────────────────────────────────────────────────────────────┐
│ YOUR DOCUMENT CORPUS │
│ (Text files: .txt, .md, etc.) │
└──────────────────────────────────┬──────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STAGE 0: SYNTHETIC DATA GENERATION (retriever-sdg) │
│ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────────────────┐ │
│ │ Document Chunks │ → │ LLM Generation │ → │ Q&A Pairs + Evaluations │ │
│ │ │ │ (NVIDIA API) │ │ │ │
│ └─────────────────┘ └─────────────────┘ └──────────────────────────┘ │
└──────────────────────────────────┬──────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STAGE 1: TRAINING DATA PREPARATION │
│ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────────────────┐ │
│ │ Train/Val/Test │ → │ Hard Negative │ → │ Multi-hop Unrolling │ │
│ │ Split │ │ Mining │ │ │ │
│ └─────────────────┘ └─────────────────┘ └──────────────────────────┘ │
└──────────────────────────────────┬──────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STAGE 2: MODEL FINE-TUNING (Cross-Encoder, Automodel) │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ Classification Loss: Query+Passage → Relevance Score ││
│ └─────────────────────────────────────────────────────────────────────────┘│
└──────────────────────────────────┬──────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STAGE 3: EVALUATION (BEIR + Reranking) │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ Dense Retrieval → Re-rank → Measure nDCG@k Improvement ││
│ └─────────────────────────────────────────────────────────────────────────┘│
└──────────────────────────────────┬──────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STAGE 4: EXPORT (ONNX/TensorRT) │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ Export Model to ONNX and TensorRT for Optimized Inference ││
│ └─────────────────────────────────────────────────────────────────────────┘│
└──────────────────────────────────┬──────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STAGE 5: DEPLOY (NIM) │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ Launch NIM Container with Custom Model for Ranking API ││
│ └─────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘
Stage |
Command |
Description |
Output |
|---|---|---|---|
Stage 0: SDG |
|
Validate corpus, generate synthetic Q&A pairs from documents |
Q&A pairs with quality scores |
Stage 1: Data Prep |
|
Convert, mine hard negatives, unroll |
Training-ready data |
Stage 2: Finetune |
|
Fine-tune cross-encoder reranking model |
Model checkpoint |
Stage 3: Eval |
|
Evaluate reranking improvement over first-stage retrieval |
Metrics comparison |
Stage 4: Export |
|
Export to ONNX/TensorRT |
Optimized inference models |
Stage 5: Deploy |
|
Deploy NIM ranking service |
Running inference service |
Installation#
1. Install UV Package Manager#
This project requires UV as its package manager. UV automatically creates and manages a virtual environment under the repository root, and each pipeline stage uses its own isolated environment as well. Do not use pip install — the project relies on UV workspaces and per-stage dependency isolation.
curl -LsSf https://astral.sh/uv/install.sh | sh
2. Clone and Install Nemotron#
# Clone the repository
git clone https://github.com/NVIDIA-NeMo/Nemotron.git
cd Nemotron
# UV creates a virtual environment at .venv/ and installs all dependencies
uv sync --all-extras
3. Get Your NVIDIA API Key#
The SDG stage (Stage 0) uses NVIDIA’s hosted LLM APIs for synthetic data generation.
Sign up at build.nvidia.com
Create an API key
Set the environment variable:
export NVIDIA_API_KEY=nvapi-your_key_here
4. Configure Execution Profiles (Optional)#
For Docker or Slurm execution, create env.toml in the repository root directory.
Minimal configuration (local execution only):
[wandb]
project = "my-reranking-project"
entity = "my-username"
Full configuration with Docker and Slurm support: See the Execution Profiles section below.
Preparing Your Corpus#
Supported Formats#
Any text files, default: .txt, .md, .text, and files with no extension
Documents should be UTF-8 encoded
Files are processed recursively from the corpus directory
Corpus Size Recommendations#
Corpus Size |
Documents |
Expected Results |
|---|---|---|
Minimum |
50-100 docs (~50K tokens) |
Basic domain adaptation |
Recommended |
500+ docs |
Good domain coverage |
Optimal |
1000+ docs |
Best performance |
Document Organization#
Organize your documents in a directory structure:
data/corpus/
├── doc1.txt
├── doc2.md
└── subdirectory/
└── doc3.txt
All files matching the file_extensions config (default: .txt,.md,.text, plus extensionless files) will be processed recursively.
Prerequisites#
Hardware Requirements#
GPU: NVIDIA GPU with at least 80GB VRAM (e.g., A100, H100)
Stage 0 uses NVIDIA API (no GPU required)
Stages 1-5: Require GPU for model inference and training
CPU: Modern multi-core processor (16+ cores recommended)
Memory: 128GB+ RAM recommended
Storage: ~50GB free disk space for outputs, models, and containers
Software Requirements#
Python: 3.12 or later
UV: Package manager (installation instructions above)
NVIDIA API Key: Required for synthetic data generation
NGC API Key: Required for authenticated NIM deployment
NVIDIA GPU Drivers: Latest drivers for your GPU
Docker (optional): For containerized execution or NIM deployment
Slurm (optional): For cluster execution
Expected Runtime & Resources#
Stage |
GPU VRAM |
CPU |
Notes |
|---|---|---|---|
Stage 0 (SDG) |
N/A |
8+ cores |
Uses API (no local GPU); runtime varies by dataset size |
Stage 1 (Data Prep) |
40GB |
16+ cores |
Hard negative mining on GPU; runtime varies by dataset size |
Stage 2 (Finetune) |
80GB |
16+ cores |
Runtime varies by dataset size and epochs |
Stage 3 (Eval) |
40GB |
8+ cores |
Dense retrieval + reranking; runtime varies by dataset size |
Stage 4 (Export) |
40GB |
8+ cores |
TensorRT export requires NGC container |
Stage 5 (Deploy) |
40GB |
4+ cores |
NIM container initialization |
LLM API Usage (Stage 0)#
Stage 0 uses LLM APIs for synthetic data generation. By default, it uses NVIDIA’s hosted LLMs:
Default provider: NVIDIA API (free tier available at build.nvidia.com)
Default model:
nvidia/nemotron-3-nano-30b-a3b(fast, reliable for structured generation)Usage: ~4 API calls per document (artifact extraction, QA generation, dedup, quality eval)
Cost: Free tier has rate limits; contact NVIDIA for production usage
Quick Start#
Local Execution#
# Set environment (important for CUDA compatibility)
export LD_LIBRARY_PATH=""
export NVIDIA_API_KEY=nvapi-your_key_here
export NGC_API_KEY=ngc-your_key_here
# Stage 0: Generate synthetic Q&A pairs from your documents
nemotron rerank sdg -c default corpus_dir=/path/to/your/docs
# Stage 1: Prepare training data (convert, mine hard negatives, unroll)
nemotron rerank prep -c default
# Stage 2: Fine-tune the cross-encoder reranking model
nemotron rerank finetune -c default
# Stage 3: Evaluate base vs fine-tuned reranker
nemotron rerank eval -c default
# Stage 4: Export to ONNX/TensorRT for NIM deployment
nemotron rerank export -c default
# Optional Stage 5: Deploy NeMo Retriever Reranking NIM with the Stage 4 ONNX export.
# Deploy mounts ./output/rerank/stage4_export/onnx by default; set model_dir to
# the tensorrt directory if you exported a TensorRT engine.
nemotron rerank deploy -c default detach=true
# Optional: Verify NIM accuracy matches checkpoint
nemotron rerank eval -c default eval_nim=true eval_base=false
Preview Commands (Dry Run)#
# See what would be executed without running
nemotron rerank finetune -c default --dry-run
Pipeline Flexibility#
Stages are designed to run sequentially, but you can start from any stage if you have the required inputs:
Start From |
Requirement |
Use Case |
|---|---|---|
Stage 0 |
Document corpus |
Full pipeline from scratch |
Stage 1 |
Q&A pairs (JSON) |
Skip SDG if you have labeled data or use NVIDIA’s pre-generated dataset |
Stage 2 |
Training data (Automodel format) |
Skip data prep if data is ready |
Stage 3 |
Model checkpoint |
Evaluate existing checkpoint |
Stage 4 |
Model checkpoint |
Export existing model |
Stage 5 |
Reranking NIM image; ONNX or TensorRT export directory |
Deploy the ranking API |
See individual stage configs for input format requirements.
Using NVIDIA’s Pre-Generated Dataset#
NVIDIA provides a ready-to-use synthetic retrieval dataset on Hugging Face: Retrieval-Synthetic-NVDocs-v1. This dataset was generated from NVIDIA’s publicly available content using the same SDG pipeline (Stage 0) in this recipe, and contains ~15K documents with 105K+ question-answer pairs across multiple reasoning types.
If you want to fine-tune a reranking model on NVIDIA-related content, you can skip Stage 0 entirely and start directly from Stage 1:
# Download the pre-generated dataset
python -c "
from datasets import load_dataset
ds = load_dataset('nvidia/Retrieval-Synthetic-NVDocs-v1', split='train')
ds.to_json('./output/rerank/stage0_sdg/nv_docs_sdg.json')
"
# Start from Stage 1 (data preparation) using the downloaded data
nemotron rerank prep -c default sdg_input_path=./output/rerank/stage0_sdg
# Continue with the rest of the pipeline
nemotron rerank finetune -c default
nemotron rerank eval -c default
Execution Modes#
The rerank recipe supports multiple execution modes for flexibility between local development and production cluster runs.
Local Execution (Default)#
Run directly on your local machine with GPU:
nemotron rerank finetune -c default
nemotron rerank eval -c default
Docker Execution#
Run inside a Docker container with GPU passthrough using --run local-docker:
# Runs the command inside a Docker container with GPU access
nemotron rerank finetune -c default --run local-docker
# Stage commands through export can run in Docker; deploy is a local-only Docker wrapper.
nemotron rerank sdg -c default --run local-docker
nemotron rerank prep -c default --run local-docker
nemotron rerank eval -c default --run local-docker
Note: Requires
local-dockerprofile inenv.toml(see Execution Profiles below)
Slurm Batch Execution#
Submit jobs to a Slurm cluster for production workloads:
# Attached execution (waits for completion, streams logs via SSH)
nemotron rerank finetune -c default --run my-cluster
# Detached execution (submits job and exits immediately)
nemotron rerank finetune -c default --batch my-cluster
# Run the SDG through eval pipeline as one sequential cluster experiment
# Requires remote_job_dir in the execution profile so stages share outputs
nemotron rerank run -c default --batch my-cluster --from sdg --to eval
# Or submit stages individually
nemotron rerank sdg -c default --batch my-cluster
nemotron rerank prep -c default --batch my-cluster
nemotron rerank finetune -c default --batch my-cluster
nemotron rerank eval -c default --batch my-cluster
Execution Profiles#
Execution profiles are defined in env.toml in the repository root directory.
Example env.toml for local and cluster execution:
# Weights & Biases configuration (optional but recommended)
[wandb]
project = "my-reranking-project"
entity = "my-team"
# Local Docker execution profile
[local-docker]
executor = "docker"
container_image = "nvcr.io/nvidia/nemo-automodel:26.04"
runtime = "nvidia" # Enable GPU passthrough
ipc_mode = "host"
shm_size = "16g"
mounts = [
"./data:/workspace/data",
"./output:/workspace/output"
]
# Slurm cluster execution profile
[my-cluster]
executor = "slurm"
account = "my-account"
partition = "interactive"
batch_partition = "batch"
container_image = "nvcr.io/nvidia/nemo-automodel:26.04"
tunnel = "ssh"
host = "cluster.example.com"
user = "username"
remote_job_dir = "/shared/path/to/jobs"
mounts = ["/shared:/shared"]
Runtime Overrides#
Override execution settings on the command line:
# Use more GPUs
nemotron rerank finetune -c default --run my-cluster run.env.gpus_per_node=4
# Use different partition
nemotron rerank finetune -c default --batch my-cluster run.env.partition=batch
# Override time limit
nemotron rerank finetune -c default --batch my-cluster run.env.time=08:00:00
Remote Job Inspection#
Use dry runs to inspect commands and generated job configuration before submitting cluster work:
nemotron rerank finetune -c default --run my-cluster --dry-run
nemotron rerank eval -c default --batch my-cluster --dry-run
Rerank cluster execution requires a profile with remote_job_dir or env_vars.NEMO_RUN_DIR for multi-stage rerank run pipelines so stage outputs share the same run directory. Local Docker execution does not require remote_job_dir; it uses the Docker work directory and declared mounts. deploy is local-only, and --stage is rejected for rerank commands until staged-file execution is implemented.
Configuration#
Each stage has a config/ directory with YAML configuration files.
File |
Purpose |
|---|---|
|
Production-ready configuration |
Key Configuration Options#
The default config points at a small pinned sample corpus. For your own corpus, override the Stage 0 paths:
Stage 0: SDG
corpus_id: my_corpus # Identifier for your corpus
corpus_dir: ./data/corpus # Path to your documents
file_extensions: ".txt,.md,.text," # File types to process, including extensionless files
output_dir: ./output/rerank/stage0_sdg # Path to save the generated data
artifact_extraction_model: nvidia/nemotron-3-nano-30b-a3b # LLM for document artifacts extraction
qa_generation_model: nvidia/nemotron-3-nano-30b-a3b # LLM for QA generation
quality_judge_model: nvidia/nemotron-3-nano-30b-a3b # LLM for QA quality evaluation
Stage 1: Data Prep
base_model: nvidia/llama-nemotron-embed-1b-v2 # Embedding model for hard negative mining
quality_threshold: 7.0 # Minimum Q&A quality score (0-10)
hard_negatives_to_mine: 5 # Number of hard negatives per query
query_max_length: 512 # Max query tokens
passage_max_length: 512 # Max passage tokens
train_ratio: 0.8 # Training data split (80%)
val_ratio: 0 # Validation split (0% — maximizes train/test for small data)
test_ratio: 0.2 # Test split (20%)
Stage 2: Finetune
base_model: nvidia/llama-nemotron-rerank-1b-v2
num_epochs: 3
global_batch_size: 128
learning_rate: 3.0e-6
lr_warmup_steps: 100
lr_decay_style: cosine
weight_decay: 0.01
optimizer_backend: auto # FusedAdam in the container, FlashAdamW locally without TE
flash_adamw_master_weight_bits: 32 # Effective master-weight precision for FlashAdamW
rerank_max_length: 512 # Max tokens for concatenated query+passage
prompt_template: "question:{query} \n \n passage:{passage}"
train_n_passages: 5 # 1 positive + 4 hard negatives
num_labels: 1
temperature: 1.0
Warning — Overfitting risk: The default
num_epochs: 3is set for the small example dataset shipped with this recipe, where fewer epochs may not produce a visible training signal. For most real-world datasets, 1–2 epochs is sufficient and 3 epochs carries a high risk of overfitting. Lower this value when working with your own data (e.g.,nemotron rerank finetune -c default num_epochs=1).
Stage 3: Eval
base_model: nvidia/llama-nemotron-rerank-1b-v2 # Base reranker for comparison
retrieval_model: nvidia/llama-nemotron-embed-1b-v2 # First-stage retriever
k_values: [1, 5, 10, 100] # K values for nDCG@k, Recall@k
top_k: 100 # Number of first-stage candidates to re-rank
max_length: 512 # Max sequence length
eval_base: true # Evaluate base reranker
eval_finetuned: true # Evaluate fine-tuned reranker
eval_nim: false # Evaluate NIM endpoint
Stage 4: Export
model_path: ./output/rerank/stage2_finetune/checkpoints/LATEST/model/consolidated
export_to_trt: false # Export to TensorRT (requires nemo:25.07+ container)
quant_cfg: null # Quantization: null, "fp8", "int8_sq"
calibration_query: what information is relevant to this query?
prompt_template: "question:{query} \n \n passage:{passage}"
trt_opt_batch: 16 # Optimal batch size for TRT
trt_opt_seq_len: 256 # Optimal sequence length for TRT
Stage 5: Deploy
nim_image: nvcr.io/nim/nvidia/llama-nemotron-rerank-1b-v2:1.10.0
model_dir: ./output/rerank/stage4_export/onnx # ONNX or TensorRT export directory
bind_address: 127.0.0.1 # Host interface for the NIM ranking API
host_port: 8000 # Host port for the NIM ranking API
detach: false # Run in background
replace_existing: false # Require explicit opt-in before replacing a running container
keep_failed_container: false # Remove unhealthy detached containers after timeout
Overriding Configuration#
Override config values on the command line:
# Override training epochs
nemotron rerank finetune -c default num_epochs=5
# Override learning rate
nemotron rerank finetune -c default learning_rate=1e-5
# Override multiple values
nemotron rerank finetune -c default num_epochs=2 learning_rate=1e-5
# Force specific attention implementation
nemotron rerank finetune -c default attn_implementation=flash_attention_2
CLI Commands#
Workspace Info#
# Display workflow overview
nemotron rerank info
Data#
# Generate synthetic Q&A pairs from documents
nemotron rerank sdg -c default corpus_dir=/path/to/docs
# Prepare training data (convert, mine, unroll)
nemotron rerank prep -c default sdg_input_path=/path/to/sdg
Training#
# Fine-tune the cross-encoder reranking model
nemotron rerank finetune -c default train_data_path=/path/to/data
Evaluation#
# Evaluate base and fine-tuned rerankers
nemotron rerank eval -c default finetuned_model_path=/path/to/checkpoint
Export#
# Export model to ONNX
nemotron rerank export -c default model_path=/path/to/checkpoint
# Export to ONNX only (skip TensorRT)
nemotron rerank export -c default export_to_trt=false
# Export with FP8 quantization
nemotron rerank export -c default quant_cfg=fp8
Deploy#
# Deploy the Stage 4 ONNX export in background (detached mode)
nemotron rerank deploy -c default detach=true
# Deploy a TensorRT export instead
nemotron rerank deploy -c default model_dir=./output/rerank/stage4_export/tensorrt
# Serve the image default model instead of a fine-tuned export
nemotron rerank deploy -c default model_dir=null
# Replace a recipe-owned container with the same name
nemotron rerank deploy -c default detach=true replace_existing=true
# Stop the NIM container
docker stop nemotron-rerank-nim
Verify NIM Accuracy#
# Evaluate NIM endpoint against fine-tuned checkpoint
nemotron rerank eval -c default eval_nim=true eval_base=false
# The output will show if NIM metrics match the checkpoint
# ok indicates metrics match within tolerance (0.03 for @1, 0.01 for @5+)
# MISMATCH indicates potential accuracy loss beyond ONNX/TensorRT conversion noise
Test the Deployed Service#
curl -X POST http://localhost:8000/v1/ranking \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/llama-nemotron-rerank-1b-v2",
"query": {"text": "what is AI?"},
"passages": [
{"text": "AI is artificial intelligence"},
{"text": "The weather is sunny today"}
],
"truncate": "END"
}'
Output Structure#
After running the full pipeline:
output/rerank/
├── stage0_sdg/ # Synthetic Q&A pairs
│ └── generated_batch*.json
├── stage1_prep/ # Training-ready data
│ ├── train.json # Original training data
│ ├── train_mined.automodel.json # With hard negatives
│ ├── train_mined.automodel_unrolled.json # Final training file
│ ├── corpus/ # Document corpus
│ └── eval_beir/ # BEIR-format evaluation data
├── stage2_finetune/ # Model checkpoints
│ └── checkpoints/
│ └── LATEST/model/consolidated/ # Final model
├── stage3_eval/ # Evaluation results
│ └── eval_results.json
└── stage4_export/ # Exported models
├── onnx/ # ONNX model files mounted by deploy by default
│ └── model.onnx
└── tensorrt/ # TensorRT engine (if enabled)
└── model.plan
Evaluation Metrics#
The evaluation stage measures reranking quality using a two-stage approach: first-stage dense retrieval followed by cross-encoder re-ranking. This mirrors how rerankers are used in production.
Metric |
Description |
Range |
|---|---|---|
nDCG@k |
Normalized Discounted Cumulative Gain (ranking quality) |
0.0-1.0 |
Recall@k |
Fraction of relevant documents in top-k results |
0.0-1.0 |
Precision@k |
Fraction of retrieved documents that are relevant |
0.0-1.0 |
MAP@k |
Mean Average Precision |
0.0-1.0 |
Higher scores indicate better re-ranking performance. The key metric to watch is nDCG@10, which captures how well the reranker promotes relevant documents to the top of the list.
Key Components#
Component |
Purpose |
Repository |
|---|---|---|
retriever-sdg |
Synthetic data generation using NeMo Data Designer |
|
Automodel |
Cross-encoder model training framework |
|
BEIR |
Evaluation framework for information retrieval |
|
NeMo Export-Deploy |
ONNX/TensorRT export for optimized inference |
|
NVIDIA NIM |
Production inference microservice with ranking API |
Base Model#
Property |
Value |
|---|---|
Model |
nvidia/llama-nemotron-rerank-1b-v2 |
Parameters |
~1B |
Architecture |
Cross-encoder (sequence classification) |
Max Sequence Length |
512 (concatenated query + passage) |
Pooling |
Average |
HuggingFace |
Further Reading#
NeMo Data Designer Documentation - Synthetic data generation framework
Automodel Documentation - Model training framework
BEIR Benchmark - Information retrieval evaluation
NVIDIA NIM Documentation - Production inference microservices
Llama-Nemotron-Rerank-1B-v2 Model Card - Base model details
Retrieval-Synthetic-NVDocs-v1 Dataset - Pre-generated synthetic retrieval dataset
Support#
For issues, questions, or contributions:
Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: Nemotron Documentation