Retrieval Fine-Tuning (Bi-Encoder and Cross-Encoder)
Retrieval Fine-Tuning (Bi-Encoder and Cross-Encoder)
Introduction
Retrieval models optimize a model for search, retrieval-augmented generation (RAG), semantic similarity, and reranking. NeMo AutoModel provides two retrieval fine-tuning recipes:
- Bi-encoder fine-tuning trains one encoder to produce query and passage embeddings. Use it when you need fast nearest-neighbor search over a document index.
- Cross-encoder fine-tuning trains a reranker that scores a query and passage together. Use it after a retriever has produced a shortlist and you want stronger ranking quality.
Both recipes use retrieval examples where the first passage is positive and the remaining passages are negatives. A common workflow is to train a bi-encoder, use it to mine harder negatives, then train either a stronger bi-encoder or a cross-encoder reranker.
Workflow Overview
Most retrieval projects move through the same loop:
Start with a bi-encoder when you need embeddings for approximate nearest-neighbor search. Add hard-negative mining after the first pass if the model mostly sees easy negatives. Train a cross-encoder when a separate retriever already produces a small candidate set and you want a stronger reranking stage.
Quickstart
Before running the examples:
- Use an AutoModel environment with the full GPU training dependencies installed. The NGC container is the safest path
for multi-GPU runs; for source checkouts, see Installation and run
uv sync --frozen --extra all. - Run the example commands from a source checkout or an NGC/container workspace that contains the repository
examples/tree. The YAML configs and mining helpers below are repo-relative; if you use an installed package without the repository, copy the referenced config/script files into your own project and update the paths. - From a source checkout, use
uv run automodel ...; from an installed environment that has local copies of the configs, useautomodel .... - Accept access terms for the configured Hugging Face model and set
HF_TOKEN, or replace the model path with a model your environment can download. See the support matrix below before swapping model families. - Make sure every rank can read the dataset paths or
hf://sources.
The commands below use automodel; if you are running from a source checkout, prefix them with uv run. For direct
torchrun commands, use uv run torchrun ... from a source checkout, or activate an installed environment first.
Start with a one-GPU smoke test. The timestamped checkpoint directory keeps this first command from silently resuming or appending metrics to an older run:
max_train_samples shortens the training rows after the configured hf:// split and corpus are loaded. This is a
training-step smoke test, not a cheap data-loading smoke test; first-run downloads and corpus loading can still take
time and disk. For a tiny data-loading test, point the config at a small local retrieval JSON/JSONL sample.
The first artifact to check is training.jsonl under checkpoint.checkpoint_dir. JSONL metrics are buffered, so
stdout/stderr are still the best live signal during a very short run.
Scale the Llama 3.2 1B bi-encoder example to the GPUs on your machine:
Run the matching cross-encoder example:
Adjust --nproc-per-node to the number of GPUs on your machine. The examples use FSDP2 and bfloat16 by default. The
example scheduler uses global_batch_size: 128 and local_batch_size: 4, so GPU counts that do not divide 32 need an
explicit --step_scheduler.global_batch_size override. For example, 6 GPUs can use
--step_scheduler.global_batch_size 120 or another multiple of 4 * 6.
Choose a Recipe
The bi-encoder computes a query embedding and passage embeddings independently. The cross-encoder formats each query-passage pair into one sequence and predicts a score for each candidate passage.
Supported model families and effective retrieval kwargs:
Known model types with a registry entry fail fast when the requested retrieval task is unsupported rather than falling
back silently. For example, direct ministral3 loading is supported for bi-encoder embeddings but not for the
cross-encoder scoring recipe. If you are extracting a text tower from a parent checkpoint, set
model.extract_submodel: language_model; extracted text backbones use the extraction path, where supported extracted
types use registered retrieval classes and unsupported extracted types can fall back to Hugging Face sequence
classification for cross-encoder scoring.
Treat unregistered decoder-only fallback models as an architecture experiment, not just a drop-in model swap. Registered
retrieval backbones such as the Llama bidirectional path use retrieval-specific attention behavior; a vanilla
Hugging Face AutoModel fallback can keep the source model’s original causal behavior, which may be lower quality for
symmetric embedding retrieval.
Prepare Data
Use the retrieval dataset format described in Retrieval Dataset. Choose the data path that matches the workflow you need:
The key field requirements differ by source:
neg_doc must be present for local JSON and JSONL sources. It may be [] only when n_passages: 1; when
n_passages > 1, provide at least one negative.
n_passages: 1 is useful for schema checks or custom negative strategies, but it is not a good default training setup.
The standard bi-encoder and cross-encoder recipes need at least one negative candidate for meaningful contrastive or
reranking supervision, unless you add a custom strategy such as qrels-aware in-batch negatives.
For quick custom experiments, inline JSONL is the simplest format. Use the inline dataset factory for these files, and switch to corpus ID-based JSON before hard-negative mining or full-corpus evaluation:
To migrate inline data to corpus ID JSON, assign a stable document ID to each unique passage, write those passages into
a corpus split with id and text columns, then replace inline pos_doc and neg_doc strings with those IDs. Keep
all known positives for each query in your qrels or source metadata, even if each training row uses only the first
positive. Otherwise, in-batch negatives and mined hard negatives can accidentally treat another relevant passage as a
negative. The detailed source schemas and conversion rules live in Retrieval Dataset.
For larger corpora, use the corpus ID-based JSON format from the dataset guide. Use
nemo_automodel.components.datasets.llm.make_retrieval_dataset for corpus ID-based JSON and for hf:// sources that
already follow the AutoModel retrieval schema, such as:
n_passages controls the size of each query group. For example, n_passages: 5 means one positive and four negatives.
Training uses the first item in pos_doc as the positive, then takes negatives from neg_doc in order. If a record has
fewer negatives than requested, negatives are repeated cyclically to fill the group. Treat repetition as a fallback for
shape compatibility; for real training and validation, prefer enough distinct negatives or lower n_passages.
The training recipe does not load a separate qrels file. Materialize qrels into retrieval records before fine-tuning.
For training, pos_doc[0] is the supervised positive. For mining, keep every known positive for the query in pos_doc
so the miner can exclude those IDs; it does not read an external qrels file. If you expand multi-positive queries into
one row per positive, make sure sibling positives are removed from neg_doc and audited out of mined negatives before
training. The helper in examples/retrieval/data_utils/unroll_pos_docs.py writes original_question_id so mined outputs
can still be joined back to the original qrels. Also keep sibling-positive rows out of the same in-batch-negative
training batch, disable distributed in-batch negatives, or add qrels-aware sampling/masking. Keep the original qrels for
offline Recall@K, MRR@K, and nDCG@K evaluation.
Minimal Config Anatomy
This minimal bi-encoder config shows the pieces that must be present in a runnable retrieval fine-tuning job. The sections below explain the model-specific parts in more detail.
For a cross-encoder, change recipe, model._target_, dataloader.dataset.model_type, and dataloader.collate_fn
to the cross-encoder values shown below. Also set model.num_labels: 1, set the loss temperature under
model.temperature, replace q_max_len / p_max_len with rerank_max_length in the collator, and use a separate
checkpoint.checkpoint_dir such as ./output/llama3_2_1b_cross_encoder/checkpoints.
Configure a Bi-Encoder
A bi-encoder config has four important parts: the model, tokenizer, retrieval dataset, and BiEncoderCollator. This
snippet is an excerpt; keep the scheduler, optimizer, checkpoint, and distributed sections from the full config or one
of the examples.
Important knobs:
pooling: controls how token hidden states become one embedding. Common single-vector choices areavg,cls,last, andweighted_avg. The hard-negative miner supports only single-vector pooling modes; do not mine withcolbertpooling, which returns token-level embeddings.l2_normalize: normalizes embeddings before scoring. When enabled, the recipe divides scores bytemperature.q_max_lenandp_max_len: set separate truncation lengths for queries and passages.query_prefixandpassage_prefix: add task-specific text before tokenization. Keep these aligned between training, hard-negative mining, and inference.do_distributed_inbatch_negative: optional model setting that treats passages from other data-parallel ranks as additional negatives. Enable it withmodel.do_distributed_inbatch_negative: trueor the CLI override--model.do_distributed_inbatch_negative true. Today it all-gathers over the default process group, so use it only for pure DP/FSDP retrieval runs (tp_size: 1,cp_size: 1). Same-document masking requiresdoc_idfields from corpus-backed or custom datasets; inline JSONL does not provide duplicate-document masking. For multi-positive queries expanded into separate rows, keep it disabled unless your sampler or masking prevents sibling positives from becoming negatives. Keep it disabled for ColBERT-style pooling.
The complete example is examples/retrieval/bi_encoder/llama3_2_1b.yaml.
Configure a Cross-Encoder
A cross-encoder config uses the same retrieval dataset factory, but sets model_type: cross_encoder and uses
CrossEncoderCollator. The dataset transform flattens each query with its positive and negative passages so the model
scores each query-passage pair. This snippet is an excerpt; keep the same scheduler, optimizer, checkpoint, and
distributed structure as the bi-encoder config.
Important knobs:
rerank_max_length: maximum combined query-passage sequence length.prompt_template: controls how the pair is serialized before tokenization. It must include{query}and{passage}.n_passages: number of candidates per query. The positive passage must remain first in each group because labels point to index0.
The complete example is examples/retrieval/cross_encoder/llama3_2_1b.yaml.
Distributed Launch and Batch Size
Launch single-node examples with automodel <config.yaml> --nproc-per-node <gpus>. The retrieval recipes support
data-parallel training through the configured distributed strategy; pipeline parallelism is not supported for encoder
recipes today.
For multi-node runs, launch with your cluster launcher or an external torchrun command so every node has an explicit
rank and rendezvous endpoint:
Use a shared or pre-populated Hugging Face cache on every node, make dataset paths visible to every rank, and use a
unique checkpoint_dir for each experiment. For multi-node training, checkpoint_dir must be on a shared, persistent
filesystem mounted at the same path from every node; relative ./output/... paths are appropriate only when they resolve
to shared storage. Increase dist_env.timeout_minutes for first model downloads, slow shared filesystems, multi-node
collectives, or large checkpoint writes.
The step scheduler computes gradient accumulation from:
global_batch_size must be divisible by local_batch_size * data_parallel_size, and the result must be at least 1.
In pure data parallelism, data_parallel_size is the total GPU count. With tensor or context parallelism enabled, it is
world_size / (tp_size * cp_size). For example, two 8-GPU nodes with tp_size: 1 and cp_size: 1 have
data_parallel_size: 16; with tp_size: 2, they have data_parallel_size: 8.
local_batch_size is the number of query groups per rank. For memory pressure, reduce
step_scheduler.local_batch_size first, then sequence lengths (q_max_len, p_max_len, or rerank_max_length), then
n_passages. Bi-encoders scale memory with query length plus local_batch_size * n_passages passage sequences;
cross-encoders scale with local_batch_size * n_passages combined query-passage sequences.
Current retrieval datasets are map-style datasets loaded in each process, not streaming distributed inputs. Pre-cache HF data on each node or use a shared cache, and budget CPU RAM and local disk per rank for corpus-backed datasets.
Add Validation
Both examples include a commented validation_dataloader block. Enable it when you have a held-out retrieval file.
Use the same dataset family as the validation source:
nemo_automodel.components.datasets.llm.retrieval_dataset.make_retrieval_dataset for hf:// or corpus ID-based JSON,
and retrieval_dataset_inline.make_retrieval_dataset only for inline JSONL. This corpus-backed example mirrors the
shipped configs:
Validation logs val_loss, val_acc1, and val_mrr to validation.jsonl under checkpoint.checkpoint_dir. These
metrics measure ranking within each candidate group in the validation file; they are not full-corpus Recall@K or nDCG
metrics. For cross-encoder validation, use model_type: cross_encoder and CrossEncoderCollator instead. In multi-rank
runs, validation uses the same distributed sampler path as training and can drop tail examples to keep rank shapes even;
make the validation set divisible by the data-parallel world size or run validation on one GPU when you need every
candidate group included.
Evaluate Retrieval Quality
Candidate-group validation is a smoke test for the training objective. To decide whether a bi-encoder is useful for RAG candidate generation, evaluate against a fixed held-out corpus and qrels:
- Encode corpus passages with the same tokenizer, pooling, normalization, passage prefix, and
p_max_lenused in training. - Build an ANN or exact top-k index. With
l2_normalize: true, use inner product or cosine similarity. - Encode held-out queries with the matching query prefix and
q_max_len. - Report full-corpus Recall@K, MRR@K, and nDCG@K for the K values your application uses.
AutoModel does not currently provide a one-command full-corpus retrieval evaluator in this guide. Use your existing IR evaluation stack or a small script around the consolidated checkpoint and report enough run details to make the result repeatable: query count, corpus size, qrels source, judged/unjudged handling, exact versus ANN search settings, K values, baseline checkpoint, and whether confidence intervals or significance tests were used. At minimum, make the script inputs explicit: a consolidated bi-encoder checkpoint, a corpus table with stable document IDs, a query table with stable query IDs, qrels keyed by those IDs, the query/passage prefixes and max lengths, and the K values to report.
For cross-encoders, freeze a first-stage retriever, rerank its top-K candidates, and report reranking metrics on that same candidate set. Also report first-stage candidate recall or coverage: if a query’s positive document is missing from the retriever top-K, count that query as a miss rather than dropping it from reranker evaluation. Do not compare cross-encoder candidate-group validation directly to full-corpus bi-encoder metrics.
Monitor Training
Training metrics are written to training.jsonl under checkpoint.checkpoint_dir. The file logger buffers records in
chunks before writing and flushes the remaining records on close, so tail -f is useful for completed or longer runs
but may not update during a short smoke test:
Use stdout/stderr as the live per-step signal today. Watch loss, grad_norm, learning rate, GPU memory, and step time
before scaling to a longer run. On preempted or timed-out jobs, recent buffered JSONL metrics may be missing even when
stdout/stderr showed them.
The examples include a commented wandb block. Enable it when you want remote tracking, and tune
step_scheduler.log_remote_every_steps to control remote logging cadence.
Enable LoRA
Retrieval recipes support the same PEFT block used by other AutoModel fine-tuning recipes. Uncomment or add peft to
train LoRA adapters instead of updating every weight:
Use LoRA when you need lower memory use or want to ship a small adapter. Use full fine-tuning when you can afford the memory and want maximum adaptation.
Mine Hard Negatives
After an initial bi-encoder run, mine harder negatives with the consolidated encoder checkpoint. Hard-negative mining
expects the corpus ID-based retrieval JSON format described in the dataset guide, not the inline JSONL shortcut. The
input must reference one corpus so the miner can build a passage embedding cache, retrieve candidates, and write mined
negatives back to each query. Every row’s corpus_id must match that single loaded corpus, and each mining row must have
a unique question_id; mismatched corpus IDs and duplicate question IDs are rejected before mining.
The quickstart configs use hf:// sources for the first train/eval path. The miner currently reads a local
corpus-backed retrieval JSON file instead of hf:// URIs directly. For a train -> mine -> retrain loop, first
materialize or preprocess your selected HF subset into the corpus ID JSON schema from
Retrieval Dataset, then set --mining.train_qa_file_path to that local JSON file.
For an AutoModel-schema HF subset such as FEVER, materialize one corpus-backed mining input with:
This writes /path/to/retrieval-data/fever-mining/train.json and a local FEVER_corpus/ directory. Run the command
once per subset/corpus that you want to mine; the helper intentionally processes one corpus per run and refuses to write
into a non-empty output directory unless you pass --overwrite. With --overwrite, it writes replacement files in
temporary paths and swaps them in only after the new subset has loaded and serialized successfully. The mining examples
below use that local train.json path.
The default bi-encoder example trains from both FEVER and SyntheticClassificationData. For a full train -> mine ->
retrain loop, materialize and mine each subset separately, give each mined output and embedding cache its own run-specific
path, then list all mined JSON files in the next config’s data_dir_list. Replacing a multi-source config with only one
mined file intentionally trains on that subset only.
This single-node example uses --standalone:
For multi-node mining, replace --standalone with the same explicit rendezvous flags you use for multi-node training.
Every rank must be able to read model_name_or_path, tokenizer_name_or_path if set, train_qa_file_path, and the
corpus path referenced by that JSON at the same filesystem paths:
Replace epoch_0_step_499 with the explicit checkpoint directory that you want to mine from. If you only have
LATEST.txt, read it first and substitute the resolved epoch_*_step_* directory; the mining script loads the
Hugging Face export directly and does not apply AutoModel’s checkpoint resolver. The miner refuses to overwrite an
existing train_file_output_path by default. Choose a new output path for each mining run, or pass
--mining.overwrite_output true only when replacing that file is intentional. If the output JSON is written to a
different directory from the input JSON, the miner rewrites relative corpus paths so retraining still resolves the
original corpus.
Key mining settings in examples/retrieval/data_utils/mining_config.yaml:
hard_negatives_to_mine: target number of negatives to add per query. The miner can return fewer when the corpus has too few valid candidates or margin filtering removes high-scoring candidates. Audit per-query counts before training.hard_neg_marginandhard_neg_margin_type: filter near-positive candidates. Withhard_neg_margin_type: perc, candidates scoring abovemin_positive_score * hard_neg_marginare removed; withabs, candidates scoring abovemin_positive_score - hard_neg_marginare removed. Inspect mined samples when positive scores are low or negative.query_prefixandpassage_prefix: keep these semantically consistent with the bi-encoder training config. The miner concatenates prefixes directly, whileBiEncoderCollatorinserts a space after non-empty prefixes; include the trailing space in mining prefixes. The miner supports static prefixes only. If training useduse_dataset_instruction: true, materialize the same instruction text into the mining input or equivalent static prefixes before mining.query_max_lengthandpassage_max_length: keep these consistent with training unless you intentionally change truncation.add_bos_tokenandadd_eos_token: match the tokenizer behavior used during training. If omitted, mining falls back to tokenizer defaults, which can differ from the training config.use_negatives_from_file: include existing negatives from the input file when mining. Existing negatives are prepended to the output and mined negatives are appended, so deduplicate and audit the output before using it for training.overwrite_output: defaults tofalse. Set it totrueonly when you intentionally want to replace an existing mined output file; the input and output paths must still be different.attn_implementation: optional model-loading escape hatch for mining exports that needsdpa,eager, orflash_attention_2pinned.cache_embeddings_dir: required for distributed mining so ranks can share cached passage embeddings. Rank0assembles the final embedding cache and score outputs, so plan memory and local disk accordingly. In multi-node mining, this must be a shared writable path mounted at the same location on every node; node-local cache paths leave rank0unable to read remote-rank shards.- Cache reuse: use a fresh cache directory for each model, dataset, prefix, sequence length,
corpus_chunk_size, and world-size combination. The miner validates cache metadata and loads the consolidated arrays to verify fingerprint, shape, and readability before reusing a consolidated cache. The fingerprint includes the mining input file, local model/tokenizer path state, ordered document IDs/content, and embedding settings. load_embeddings_from_cache: set this totrueonly when you intentionally want to reuse every cached query shard, corpus chunk, and consolidated embedding file from the same model/input/prefix/length/corpus_chunk_size/world-size run. Fresh run-specific paths are still easier to reason about, especially for mutable Hub IDs or paths that are overwritten in place.
pooling and l2_normalize are saved bi-encoder wrapper metadata, not mining.* config fields. Do not pass
--mining.pooling or --mining.l2_normalize; the miner rejects unknown mining keys. Mine from a saved bi-encoder export
produced with the wrapper settings you want. For an older export that does not carry this metadata, write a new export
before mining:
Hard-negative mining parallelizes embedding generation across ranks, but the final exact scoring step still runs on
rank 0 and materializes the full document embedding matrix there. For very large corpora, use a smaller mining slice
or a custom ANN/blockwise mining workflow instead of expecting this helper to scale to web-scale indexing.
Use the mined output as the next corpus-backed data_dir_list source for another bi-encoder pass or for cross-encoder training. If
the previous run used multiple sources, list the mined file for each source. Hard negative mining excludes document IDs
listed in each input row’s pos_doc, but it cannot read an external qrels file or know every semantically relevant
duplicate. Put all known positive IDs for the query in the mining input, deduplicate the corpus, inspect mined samples,
filter duplicate IDs and non-finite scores such as -inf from mined outputs, and avoid mining from validation or test
corpora. If you unroll multi-positive training data, mine from rows that still carry every known positive in pos_doc;
otherwise sibling positives can be mined as false negatives. Custom row-level metadata from the input JSON is preserved
in the mined output, while neg_doc and positive-document scores are refreshed.
Run the audit utility before reusing mined output:
--allow-findings keeps this first inspection command from failing the shell when it finds issues. Omit it in CI or
quality gates when findings should fail the job. --min-negatives 1 flags rows that would fail or become degenerate
when retraining with n_passages > 1; increase it if your next training config requires more distinct negatives before
oversampling.
If the report only contains issues that you want to drop automatically, write a cleaned copy:
With --drop-invalid-negatives --output, the command exits successfully when the cleaned output has no remaining audit
findings and still satisfies --min-negatives if you set it. The audit flags missing or non-finite positive scores, and
flags and drops negatives whose IDs also appear in the row’s pos_doc, duplicate negative IDs in the same row, missing
negative scores, and non-finite negative scores.
The cleaned output preserves query lineage fields such as original_question_id, so unrolled examples remain traceable
to their source question.
Save, Resume, and Use the Checkpoint
Set checkpointing in the config:
Each save creates a versioned directory such as:
Checkpoint directory names use the scheduler step at save time. The saved scheduler state advances to the next step, so
for exact paths prefer the Saving checkpoint to ... log line or the LATEST pointer over hand-constructing a step
number.
With save_consolidated: true and full fine-tuning, AutoModel also writes a Hugging Face-compatible model under:
Use the concrete epoch_*_step_* directory printed in your logs. Some workflows also create a LATEST symlink, but
direct Hugging Face and mining loads expect a real exported model path. If your run produced LATEST.txt instead of a
symlink, read that file and substitute the resolved checkpoint directory before calling from_pretrained() or
mine_hard_negatives.py.
PEFT/LoRA runs save adapter artifacts under the checkpoint model/ directory instead of the full consolidated export
path above. Resume LoRA training from the AutoModel checkpoint directory, but use full fine-tuning when you need the
model/consolidated path for the mining command shown in this guide. If you need mining or serving from LoRA weights,
first produce a HF-loadable merged/exported encoder with your adapter workflow and point
--mining.model_name_or_path at that exported directory.
The LATEST symlink points to the most recent checkpoint when it is valid. To resume from the latest resolved
checkpoint, set:
LATEST is a resolver keyword: AutoModel follows the symlink or pointer file and can fall back to scanning
epoch_*_step_* checkpoint directories if the pointer is not usable. An explicit epoch_*_step_* path is the exact
restore target. If checkpoint.restore_from is omitted, AutoModel auto-detects the latest compatible checkpoint in
checkpoint_dir and resumes from it. Use a new or empty checkpoint_dir for fresh experiments, and rotate or clear
training.jsonl and validation.jsonl if you do not want logs from multiple runs appended together.
When checkpoint.is_async: true, the LATEST symlink can lag the most recent write at job end. For final mining,
export, or evaluation workflows, prefer the explicit epoch_*_step_* checkpoint directory or keep async checkpointing
disabled for the final save.
Use the Model
Use a bi-encoder checkpoint to encode passages, build an approximate nearest-neighbor index, encode queries, and search the index. Keep the same tokenizer, pooling, normalization, prefixes, and max lengths that you used for training.
Minimal bi-encoder loading and scoring sketch:
Use a cross-encoder checkpoint to rerank a shortlist from a retriever. Cross-encoders score each query-passage pair jointly, so they are usually too expensive for first-stage full-corpus search.
Minimal cross-encoder scoring sketch:
Bi-encoder scores are comparable only within the same model, tokenizer, prefix, max-length, pooling, normalization, and indexing setup. Mining scores are raw embedding similarities from that exact setup. Cross-encoder logits are uncalibrated reranking signals; do not mix them with bi-encoder scores or use one global threshold across model versions without calibration.
Troubleshooting
Related Files
- Bi-encoder recipe: nemo_automodel/recipes/retrieval/train_bi_encoder.py
- Cross-encoder recipe: nemo_automodel/recipes/retrieval/train_cross_encoder.py
- Retrieval dataset guide: Retrieval Dataset
- Llama-Embed-Nemotron-8B example: examples/retrieval/bi_encoder/llama_embed_nemotron_8b/llama_embed_nemotron_8b.yaml