Embedding Model Fine-Tuning Recipe#
A complete 6-stage pipeline for fine-tuning and deploying embedding models on domain-specific data using synthetic data generation.
Quick Start#
Prerequisites#
NVIDIA GPU with at least 80GB VRAM (e.g., A100, H100) for Stages 1β5
NVIDIA API Key for synthetic data generation (Stage 0) β free tier at build.nvidia.com
Python 3.12+ and UV package manager
Installation#
git clone https://github.com/NVIDIA/nemotron
cd nemotron
uv sync --all-extras
Configuration#
Set your NVIDIA API key for synthetic data generation:
export NVIDIA_API_KEY=nvapi-your_key_here
Optionally, create an env.toml file for Docker or Slurm execution (see Execution through NeMo-Run for details):
[wandb]
project = "my-embedding-project"
entity = "my-username"
[local-docker]
executor = "docker"
container_image = "nvcr.io/nvidia/pytorch:25.01-py3"
runtime = "nvidia"
ipc_mode = "host"
shm_size = "16g"
[my-cluster]
executor = "slurm"
account = "my-account"
partition = "interactive"
container_image = "nvcr.io/nvidia/pytorch:25.01-py3"
Run the Pipeline#
// Set environment
$ export LD_LIBRARY_PATH=""
$ export NVIDIA_API_KEY=nvapi-your_key_here
// Stage 0: Generate synthetic Q&A pairs (sample corpus auto-downloaded from HuggingFace)
$ nemotron embed sdg -c default
// Or use your own documents:
// nemotron embed sdg -c default corpus_dir=/path/to/your/docs
// Stage 1: Prepare training data (convert, mine hard negatives, unroll)
$ nemotron embed prep -c default
// Stage 2: Fine-tune the embedding model
$ nemotron embed finetune -c default
// Stage 3: Evaluate base vs fine-tuned model
$ nemotron embed eval -c default
// Stage 4: Export to ONNX/TensorRT for deployment
$ nemotron embed export -c default
// Stage 5: Deploy NIM with custom model
$ nemotron embed deploy -c default
Resources#
Base Model: Llama-Nemotron-Embed-1B-v2 on HuggingFace
Key Components:
NeMo Data Designer β Synthetic data generation
NeMo Automodel β Embedding model training
BEIR β Information retrieval evaluation
NeMo Export-Deploy β ONNX/TensorRT export
NVIDIA NIM β Production inference microservices
Training Pipeline#
Stage |
Command |
Description |
Output |
|---|---|---|---|
0 |
|
Generate synthetic Q&A pairs from documents |
Q&A pairs with quality scores |
1 |
|
Convert, mine hard negatives, unroll |
Training-ready data |
2 |
|
Fine-tune embedding model |
Model checkpoint |
3 |
|
Evaluate on retrieval metrics |
Metrics comparison |
4 |
|
Export to ONNX/TensorRT |
Optimized inference models |
5 |
|
Deploy NIM with custom model |
Running inference service |
Base Model#
Property |
Value |
|---|---|
Model |
nvidia/llama-nemotron-embed-1b-v2 |
Parameters |
~1B |
Embedding Dimension |
2048 |
Max Sequence Length |
8192 |
Pooling |
Average |
Why Fine-Tune Embedding Models?#
Pre-trained embedding models work well for general-purpose retrieval, but may underperform on specialized domains with unique terminology, document structures, or query patterns. Fine-tuning adapts the model to:
Understand domain-specific vocabulary and concepts
Better match the types of queries your users will ask
Improve retrieval accuracy on your specific document corpus
Stage Summaries#
Stage 0: Synthetic Data Generation#
Validates your document corpus, chunks documents, and uses NVIDIAβs hosted LLMs to generate synthetic question-answer pairs with quality scoring. Supports configurable LLM providers and parallel request batching.
Stage 1: Training Data Preparation#
Splits data into train/val/test sets, mines hard negatives using the base embedding model, and unrolls multi-hop examples into the format expected by Automodel training.
Stage 2: Model Fine-Tuning#
Fine-tunes the Llama-Nemotron-Embed-1B-v2 model using contrastive learning β queries are trained to be closer to their positive documents than to hard negatives.
Stage 3: Evaluation#
Evaluates both the base and fine-tuned models on standard information retrieval metrics (nDCG, Recall, Precision, MAP) using the BEIR framework on your held-out test set.
Stage 4: Export#
Exports the fine-tuned model to ONNX and optionally TensorRT for optimized inference. Supports FP8 and INT8 quantization.
Stage 5: Deployment#
Deploys the exported model as an NVIDIA NIM container for production inference, with built-in accuracy verification against the original checkpoint.
Pipeline Flexibility#
Stages run sequentially, but you can start from any stage if you have the required inputs:
Start From |
Requirement |
Use Case |
|---|---|---|
Stage 0 |
Document corpus |
Full pipeline from scratch |
Stage 1 |
Q&A pairs (JSON) |
Skip SDG if you have labeled data |
Stage 2 |
Training data (Automodel format) |
Skip data prep if data is ready |
Stage 3 |
Model checkpoint |
Evaluate existing checkpoint |
Stage 4 |
Model checkpoint |
Export existing model |
Stage 5 |
Exported model (ONNX/TensorRT) |
Deploy existing model |
Preparing Your Corpus#
Default Formats#
By default, the pipeline processes .txt, .md, and files with no extension. You can configure which extensions to include via the file_extensions option:
nemotron embed sdg -c default file_extensions=".txt,.md,.rst,.html"
Documents should be UTF-8 encoded
Files are processed recursively from the corpus directory
Corpus Size Recommendations#
Corpus Size |
Documents |
Expected Results |
|---|---|---|
Minimum |
50β100 docs (~50K tokens) |
Basic domain adaptation |
Recommended |
500+ docs |
Good domain coverage |
Optimal |
1000+ docs |
Best performance |
Document Quality Tips#
Length: Aim for 200β2000 tokens per document
Content: Ensure documents are representative of your domain
Diversity: Include various document types and topics
Quality: Clean, well-formatted text yields better synthetic Q&A pairs
Execution Options#
The embed recipe supports multiple execution modes:
Mode |
Command |
Use Case |
|---|---|---|
Local (default) |
|
Development, single GPU |
Docker |
|
Containerized with GPU passthrough |
Slurm (attached) |
|
Cluster, streams logs |
Slurm (detached) |
|
Cluster, submits and exits |
All stages also support --dry-run to preview execution and --stage for interactive debugging on a cluster.
See Execution through NeMo-Run for profile configuration.
Configuration#
Each stage has a config/ directory with YAML configuration files. Override values on the command line:
nemotron embed finetune -c default num_epochs=5 learning_rate=2e-5
Key Options by Stage#
Stage 0 (SDG):
Option |
Default |
Description |
|---|---|---|
|
|
Path to your documents (sample auto-downloaded from HuggingFace) |
|
|
File types to process |
|
|
LLM for document extraction |
|
|
Parallel API requests |
Stage 1 (Data Prep):
Option |
Default |
Description |
|---|---|---|
|
|
Minimum Q&A quality score (0β10) |
|
|
Hard negatives per query |
|
|
Training data split |
Stage 2 (Finetune):
Option |
Default |
Description |
|---|---|---|
|
|
Training epochs |
|
|
Auto-scaled down for small datasets |
|
|
Learning rate |
|
|
1 positive + 4 hard negatives |
Stage 3 (Eval):
Option |
Default |
Description |
|---|---|---|
|
|
K values for Recall@k, nDCG@k |
|
|
Evaluate base model |
|
|
Evaluate fine-tuned model |
|
|
Evaluate NIM endpoint |
Stage 4 (Export):
Option |
Default |
Description |
|---|---|---|
|
|
Also export to TensorRT |
|
|
Quantization: |
Stage 5 (Deploy):
Option |
Default |
Description |
|---|---|---|
|
|
NIM container image |
|
|
Port for NIM API |
|
|
Run in background |
Hardware Requirements#
Stage |
GPU VRAM |
CPU |
Notes |
|---|---|---|---|
Stage 0 (SDG) |
N/A |
8+ cores |
Uses API (no local GPU) |
Stage 1 (Data Prep) |
40GB |
16+ cores |
Hard negative mining on GPU |
Stage 2 (Finetune) |
80GB |
16+ cores |
Contrastive training |
Stage 3 (Eval) |
40GB |
8+ cores |
Evaluation metrics computation |
Stage 4 (Export) |
40GB |
8+ cores |
TensorRT export requires NGC container |
Stage 5 (Deploy) |
40GB |
4+ cores |
NIM container initialization |
Total disk space: ~50GB for outputs, model checkpoints, and containers.
Evaluation Metrics#
The evaluation stage computes standard information retrieval metrics using the BEIR framework:
Metric |
Description |
Range |
|---|---|---|
nDCG@k |
Normalized Discounted Cumulative Gain (ranking quality) |
0.0β1.0 |
Recall@k |
Fraction of relevant documents in top-k results |
0.0β1.0 |
Precision@k |
Fraction of retrieved documents that are relevant |
0.0β1.0 |
MAP@k |
Mean Average Precision |
0.0β1.0 |
Good fine-tuning results typically show nDCG@10 and Recall@10 improvement of 15%+ over the base model.
Interpreting Results#
No improvement: May need more training data or higher quality Q&A pairs
Worse performance: Check data quality issues or training hyperparameters
Overfitting: Good training metrics but poor validation metrics
Output Structure#
output/embed/
βββ stage0_sdg/ # Synthetic Q&A pairs
βββ stage1_data_prep/ # Training-ready data
β βββ train_mined.automodel_unrolled.json # Final training file
β βββ eval_beir/ # BEIR-format evaluation data
β βββ corpus/ # Document corpus
βββ stage2_finetune/ # Model checkpoints
β βββ checkpoints/LATEST/model/consolidated/
βββ stage3_eval/ # Evaluation results
β βββ eval_results.json
βββ stage4_export/ # Exported models
βββ onnx/ # ONNX model files
βββ tensorrt/ # TensorRT engine
Troubleshooting#
NVIDIA_API_KEY not set: Set your API key with export NVIDIA_API_KEY=nvapi-your_key_here.
CUDA out of memory during training: Reduce batch size (global_batch_size=64) or use gradient accumulation.
nvJitLink or CUDA symbol errors: Clear LD_LIBRARY_PATH with export LD_LIBRARY_PATH="".
HybridCache import errors: Clear HuggingFace cache with rm -rf ~/.cache/huggingface/modules/transformers_modules/nvidia/.
No valid Q&A pairs after filtering: Lower quality_threshold (default: 7.0) or check SDG output quality.
TensorRT export fails: Ensure using NGC container with TensorRT (nemo:25.07+), or try ONNX-only: export_to_trt=false.
NIM container fails to start: Verify NGC credentials (docker login nvcr.io), check port availability, or try a different port (host_port=8002).
NIM accuracy differs from checkpoint: Ensure same model format (TensorRT vs ONNX), check quantization settings, verify model files are complete.
Best Practices#
Data Quality#
Start with a small corpus to test the pipeline, then scale up
Use clean, well-formatted documents representative of your target domain
Include diverse document types and topics
Training#
Start with default hyperparameters (3 epochs, LR 1e-5, batch size auto-scaled)
Monitor validation metrics to avoid overfitting
Key parameters to tune: epochs, learning rate, and warmup steps
Evaluation#
Always compare against the base model
Test on a held-out test set (not used in training)
Consider multiple metrics (nDCG, Recall, Precision)
Deployment#
Test exported models before production use
Verify NIM accuracy matches the checkpoint:
nemotron embed eval -c default eval_nim=true eval_base=falseMonitor inference latency and throughput
CLI Reference#
// Show available commands
$ nemotron embed --help
// Display workflow overview
$ nemotron embed info
// Preview any command without executing
$ nemotron embed finetune -c default --dry-run
Further Reading#
NeMo Data Designer β Synthetic data generation framework
NeMo Automodel β Model training framework
BEIR Benchmark β Information retrieval evaluation
NVIDIA NIM β Production inference microservices
Llama-Nemotron-Embed-1B-v2 Model Card β Base model details
Execution through NeMo-Run β Cluster and Docker execution profiles