Nv-DINOv2#
Introduction#
DINOv2 (Distillation with No Labels v2) is an advanced self-supervised learning method designed to train vision transformers without relying on labeled data. It builds on the original DINO framework by introducing architectural and training improvements that lead to more robust, transferable, and semantically rich feature representations.
At its core, DINOv2 uses a self-distillation strategy, where a student model learns from a momentum-updated teacher model. The two models receive different augmented views of the same image, and the student is trained to align its representations with those of the teacher. This encourages the model to develop a consistent understanding of the underlying image content, independent of transformations such as cropping, scaling, or color jitter.
DINOv2 has proven effective in capturing high-level semantic features, and has demonstrated strong performance across a wide range of downstream tasks without the need for task-specific supervision.
Benefits#
No labels required: Enables training on large, unlabeled image datasets with minimal human intervention.
High-quality features: Learns semantic-rich, transferable representations suitable for tasks like classification, detection, and segmentation.
Robust to augmentation: The use of multiple views during training enhances generalization and invariance to common image transformations.
Data Input for Nv-DINOv2#
Nv-DINOv2 expects input data to be RGB images stored in a single directory. Supported image formats include: .jpg, .jpeg, .png, .ppm, .bmp, .pgm, .tif, .tiff, and .webp.
Creating a Training Experiment Spec File#
Configuring a Custom Dataset#
This section provides example configuration and commands for training Nv-DINOv2 using the dataset format described above.
SPECS=$(tao-client nvdinov2 get-spec --action train --job_type experiment --id $EXPERIMENT_ID)
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
Here is an example spec file for training a ViT large NV-DINOv2:
encryption_key: tlt_encode
results_dir: /path/to/experiment_results
model:
distill:
enable: False
disable_masking: False
pretrained_non_distill_pl_model_path: null
backbone:
teacher_type: "vit_l"
student_type: "vit_l"
drop_path_rate: 0.4
patch_size: 14
img_size: 518
head:
num_layers: 3
hidden_dim: 2048
bottleneck_dim: 384
dataset:
train_dataset:
images_dir: /path/to/img_dir
test_dataset:
images_dir: "???"
batch_size: 16
workers: 10
transform:
n_global_crops: 2
global_crops_scale: [0.32, 1.0]
global_crops_size: 224
n_local_crops: 8
local_crops_scale: [0.05, 0.32]
local_crops_size: 98
train:
resume_training_checkpoint_path: null
pretrained_model_path: /path/to/pretrained/model.pth
num_nodes: 1
num_gpus: 1
num_epochs: 10
checkpoint_interval: 1
layerwise_decay: 1.0
clip_grad_norm: 3.0
optim:
optim: "adamw"
schedulers:
learning_rate:
val_base: "${eval: '2e-4 * (${dataset.batch_size} * ${train.num_gpus} * ${train.num_nodes} / 1024) ** (1/2)'}"
val_final: 1e-6
warm_up_steps: 100000
last_layer_learning_rate:
val_base: "${eval: '2e-4 * (${dataset.batch_size} * ${train.num_gpus} * ${train.num_nodes} / 1024) ** (1/2)'}"
val_final: 1e-6
warm_up_steps: 100000
freeze_steps: 1250
weight_decay:
val_base: 0.04
val_final: 0.2
momentum:
val_base: 0.994
val_final: 1.0
teacher_temperature:
val_base: 0.07
val_final: 0.07
val_start: 0.04
warm_up_steps: 37500
num_prototypes: 131072
results_dir: "${results_dir}/train"
inference:
checkpoint: "???"
export:
gpu_id: 0
checkpoint: "???"
onnx_file: "???"
input_width: 518
input_height: 518
Parameter |
Data Type |
Default |
Description |
Supported Values |
---|---|---|---|---|
model |
dict config |
– |
The configuration of the model architecture. |
|
dataset |
dict config |
– |
The configuration of the dataset. |
|
train |
dict config |
– |
The configuration of the training task. |
|
inference |
dict config |
– |
The configuration of the inference task. |
|
encryption_key |
string |
None |
The encryption key to encrypt and decrypt model files. |
|
results_dir |
string |
/results |
The directory where experiment results are saved. |
|
export |
dict config |
– |
The configuration of the ONNX export task. |
train#
Parameter |
Data Type |
Default |
Description |
Supported Values |
---|---|---|---|---|
pretrained_model_path |
string |
None |
Path to a pretrained NVDINOv2 model to initialize training. |
|
layerwise_decay |
float |
1.0 |
Layerwise learning rate decay factor. |
(0, 1] |
clip_grad_norm |
float |
3.0 |
Maximum gradient norm for gradient clipping. |
> 0 |
num_prototypes |
int |
131072 |
Number of prototypes used in the training. |
> 0 |
precision |
string |
16-mixed |
Mixed precision setting for training. |
|
use_custom_attention |
bool |
True |
If set to |
True, False |
schedulers |
dict config |
– |
Configuration for learning rate schedulers. |
|
optim
|
dict config
string
|
–
adamw
|
Configuration for the optimizer.
* optim : Optimizer type.
|
adamw
|
schedulers#
Parameter |
Data Type |
Default |
Description |
Supported Values |
---|---|---|---|---|
learning_rate |
dict config |
– |
Learning rate scheduler configuration. |
|
last_layer_learning_rate |
dict config |
– |
Last layer learning rate scheduler configuration. |
|
weight_decay |
dict config |
– |
Weight decay scheduler configuration. |
|
momentum |
dict config |
– |
Momentum scheduler configuration. |
|
teacher_temperature |
dict config |
– |
Teacher temperature scheduler configuration. |
learning_rate#
Parameter |
Data Type |
Default |
Description |
Supported Values |
---|---|---|---|---|
val_base |
float |
7.07e-6 |
The value after warm-up for scheduler. |
> 0 |
val_final |
float |
1e-6 |
Final value for scheduler. |
> 0 |
val_start |
float |
0.0 |
Starting value for scheduler. |
>= 0 |
warm_up_steps |
int |
100000 |
Number of warm-up steps. |
>= 0 |
max_decay_steps |
int |
2500000 |
Maximum number of steps over which the value decays. |
> 0 |
last_layer_learning_rate#
Parameter |
Data Type |
Default |
Description |
Supported Values |
---|---|---|---|---|
val_base |
float |
7.07e-6 |
The value after warm-up for scheduler. |
> 0 |
val_final |
float |
1e-6 |
Final value for scheduler. |
> 0 |
val_start |
float |
0.0 |
Starting value for scheduler. |
>= 0 |
warm_up_steps |
int |
100000 |
Number of warm-up steps. |
>= 0 |
max_decay_steps |
int |
2500000 |
Maximum number of steps over which the value decays. |
> 0 |
freeze_steps |
int |
1250 |
Number of initial steps during which the layer is frozen. |
>= 0 |
weight_decay#
Parameter |
Data Type |
Default |
Description |
Supported Values |
---|---|---|---|---|
val_base |
float |
0.04 |
The value after warm-up for scheduler. |
> 0 |
val_final |
float |
0.2 |
Final value for scheduler. |
> 0 |
val_start |
float |
0.0 |
Starting value for scheduler. |
>= 0 |
warm_up_steps |
int |
0 |
Number of warm-up steps. |
>= 0 |
max_decay_steps |
int |
2500000 |
Maximum number of steps over which the value decays. |
> 0 |
momentum#
Parameter |
Data Type |
Default |
Description |
Supported Values |
---|---|---|---|---|
val_base |
float |
0.994 |
The value after warm-up for scheduler. |
> 0 |
val_final |
float |
1.0 |
Final value for scheduler. |
> 0 |
val_start |
float |
0.0 |
Starting value for scheduler. |
>= 0 |
warm_up_steps |
int |
0 |
Number of warm-up steps. |
>= 0 |
max_decay_steps |
int |
2500000 |
Maximum number of steps over which the value decays. |
> 0 |
teacher_temperature#
Parameter |
Data Type |
Default |
Description |
Supported Values |
---|---|---|---|---|
val_base |
float |
0.07 |
The value after warm-up for scheduler. |
> 0 |
val_final |
float |
0.07 |
Final value for scheduler. |
> 0 |
val_start |
float |
0.04 |
Starting value for scheduler. |
>= 0 |
warm_up_steps |
int |
37500 |
Number of warm-up steps. |
>= 0 |
max_decay_steps |
int |
37500 |
Maximum number of steps over which the value decays. |
> 0 |
model#
Parameter |
Data Type |
Default |
Description |
Supported Values |
---|---|---|---|---|
distill |
dict config |
– |
Configuration for the NVDINOv2 distillation module. |
|
backbone |
dict config |
– |
Configuration for the NVDINOv2 backbone. |
|
head |
dict config |
– |
Configuration for the NVDINOv2 projection and prediction head. |
distill#
Parameter |
Data Type |
Default |
Description |
Supported Values |
---|---|---|---|---|
enable |
bool |
False |
If set to |
True, False |
disable_masking |
bool |
False |
If set to |
True, False |
pretrained_non_distill_pl_model_path |
string |
None |
Path to a non-distilled SSL-trained DINOv2 model used to initialize the teacher model. |
backbone#
Parameter |
Data Type |
Default |
Description |
Supported Values |
---|---|---|---|---|
teacher_type |
string |
vit_l |
The teacher backbone type. |
vit_l, vit_b, vit_s |
student_type |
string |
vit_l |
The student backbone type. |
vit_l, vit_b, vit_s |
num_register_tokens |
int |
0 |
Number of register tokens. |
≥ 0 |
drop_path_rate |
float |
0.4 |
Drop path rate for stochastic depth regularization. |
[0, 1) |
patch_size |
int |
14 |
Size of input patches. |
14, 16 |
img_size |
int |
518 |
Input image size used in the backbone. |
224, 518 |
head#
Parameter |
Data Type |
Default |
Description |
Supported Values |
---|---|---|---|---|
num_layers |
int |
3 |
Number of layers in the NVDINOv2 head. |
≥ 1 |
hidden_dim |
int |
2048 |
Dimension of the hidden layers in the NVDINOv2 head. |
≥ 1 |
bottleneck_dim |
int |
384 |
Dimension of the bottleneck layer in the NVDINOv2 head. |
≥ 1 |
dataset#
Parameter |
Data Type |
Default |
Description |
Supported Values |
---|---|---|---|---|
train_dataset
|
dict config
string
|
None
/data
|
Configuration for the training dataset.
* images_dir : Path to images directory for training dataset.
|
|
test_dataset
|
dict config
string
|
None
/data
|
Configuration for the testing dataset.
* images_dir : Path to images directory for testing dataset.
|
|
batch_size |
int |
4 |
The batch size for training. |
≥ 1 |
pin_memory |
bool |
True |
If set to |
True, False |
workers |
int |
8 |
Number of parallel workers used in data loading. |
≥ 1 |
transform
|
dict config
int
list[float]
int
int
list[float]
int
|
–
2
[0.32, 1.0]
224
8
[0.05, 0.32]
98
|
Configuration parameters for data transformation:
* n_global_crops : Number of global crops to generate.
* global_crops_scale : Scale range for global crops.
* global_crops_size : Size (in pixels) of global crops.
* n_local_crops : Number of local crops to generate.
* local_crops_scale : Scale range for local crops.
* local_crops_size : Size (in pixels) of local crops.
|
≥ 1
Range: (0.0, 1.0]
≥ 1
≥ 1
Range: (0.0, 1.0]
≥ 1
|
Example spec File for Distillation from ViT Large to Vit Base#
encryption_key: tlt_encode
results_dir: /path/to/experiment_results
model:
distill:
enable: True
disable_masking: False
pretrained_non_distill_pl_model_path: /path/to/pretrained/model.pth
backbone:
teacher_type: "vit_l"
student_type: "vit_l"
drop_path_rate: 0.4
patch_size: 14
img_size: 518
head:
num_layers: 3
hidden_dim: 2048
bottleneck_dim: 384
dataset:
train_dataset:
images_dir: /path/to/img_dir
test_dataset:
images_dir: "???"
batch_size: 16
workers: 10
transform:
n_global_crops: 2
global_crops_scale: [0.32, 1.0]
global_crops_size: 224
n_local_crops: 8
local_crops_scale: [0.05, 0.32]
local_crops_size: 98
train:
resume_training_checkpoint_path: null
pretrained_model_path: /path/to/pretrained/model.pth
num_nodes: 1
num_gpus: 1
num_epochs: 10
checkpoint_interval: 1
layerwise_decay: 1.0
clip_grad_norm: 3.0
optim:
optim: "adamw"
schedulers:
learning_rate:
val_base: "${eval: '2e-4 * (${dataset.batch_size} * ${train.num_gpus} * ${train.num_nodes} / 1024) ** (1/2)'}"
val_final: 1e-6
warm_up_steps: 100000
last_layer_learning_rate:
val_base: "${eval: '2e-4 * (${dataset.batch_size} * ${train.num_gpus} * ${train.num_nodes} / 1024) ** (1/2)'}"
val_final: 1e-6
warm_up_steps: 100000
freeze_steps: 1250
weight_decay:
val_base: 0.04
val_final: 0.2
momentum:
val_base: 0.994
val_final: 1.0
teacher_temperature:
val_base: 0.07
val_final: 0.07
val_start: 0.04
warm_up_steps: 37500
num_prototypes: 131072
results_dir: "${results_dir}/train"
inference:
checkpoint: "???"
export:
gpu_id: 0
checkpoint: "???"
onnx_file: "???"
input_width: 518
input_height: 518
Training the Model#
Use the following command to run NV-DINOv2 training:
TRAIN_JOB_ID=$(tao-client nvdinov2 experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
Multi-Node Training with FTMS
Distributed training is supported through FTMS. For large models, multi-node clusters can bring significant speedups and performance improvements for training.
Please verify that your cluster has multiple GPU enabled nodes available for training. You can do this by running the following command:
kubectl get nodes -o wide
You should see multiple nodes listed. If you do not see multiple nodes, please contact your cluster administrator to get more nodes added to your cluster.
To run a multi-node training job through FTMS, you can modify the following fields in the training job spec:
{
"train": {
"num_gpus": 8, // Number of GPUs per node
"num_nodes": 2 // Number of nodes to use for training
}
}
If these fields are not specified, the default value of 1 GPU per node and 1 node will be used.
Note
The number of GPUs specified in the num_gpus
field must not exceed the number of GPUs per node in the cluster.
The number of nodes specified in the num_nodes
field must not exceed the number of nodes in the cluster.
tao model nvdinov2 train [-h] -e <experiment_spec>
[results_dir=<global_results_dir>]
[model.<model_option>=<model_option_value>]
[dataset.<dataset_option>=<dataset_option_value>]
[train.<train_option>=<train_option_value>]
[train.gpu_ids=<gpu indices>]
[train.num_gpus=<number of gpus>]
Required Arguments
The following arguments are required:
-e, --experiment_spec_file
: The path to the experiment spec file.
Optional Arguments
You can set optional arguments to override the option values in the experiment spec file.
-h, --help
: Show this help message and exit.model.<model_option>
: The model options.dataset.<dataset_option>
: The dataset options.train.<train_option>
: The train options.
Distilling the Model#
To distill the model, use the following command:
DISTILL_JOB_ID=$(tao-client nvdinov2 experiment-run-action --action distill --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
tao model nvdinov2 distill [-h] -e <experiment_spec>
[results_dir=<global_results_dir>]
[model.<model_option>=<model_option_value>]
[dataset.<dataset_option>=<dataset_option_value>]
[train.<train_option>=<train_option_value>]
[train.gpu_ids=<gpu indices>]
[train.num_gpus=<number of gpus>]
Required Arguments
The following arguments are required:
-e, --experiment_spec_file
: The path to the experiment spec file.
Optional Arguments
You can set optional arguments to override the option values in the experiment spec file.
-h, --help
: Show this help message and exit.model.<model_option>
: The model options.dataset.<dataset_option>
: The dataset options.train.<train_option>
: The train options.
Inference the Model#
Creating Inference Experiment Spec File#
Here is an example spec file for inference and export a ViT large NV-DINOv2:
encryption_key: tlt_encode
results_dir: /path/to/experiment_results
model:
distill:
enable: False
disable_masking: False
pretrained_non_distill_pl_model_path: null
backbone:
teacher_type: "vit_l"
student_type: "vit_l"
drop_path_rate: 0.4
patch_size: 14
img_size: 518
head:
num_layers: 3
hidden_dim: 2048
bottleneck_dim: 384
dataset:
train_dataset:
images_dir: "???"
test_dataset:
images_dir: /path/to/img_dir
batch_size: 16
workers: 10
transform:
n_global_crops: 2
global_crops_scale: [0.32, 1.0]
global_crops_size: 224
n_local_crops: 8
local_crops_scale: [0.05, 0.32]
local_crops_size: 98
inference:
checkpoint: /path/to/model.pth
results_dir: /path/to/experiment_results/inference
Parameter |
Datatype |
Default |
Description |
Supported Values |
---|---|---|---|---|
checkpoint |
string |
Path to PyTorch model to evaluate/inference. |
||
trt_engine |
string |
Path to TensorRT model to inference/evaluate. Should only be used with TAO Deploy. |
||
num_gpus |
unsigned int |
1 |
The number of GPUs to use. |
>0 |
gpu_ids |
List[int] |
[0] |
The GPU IDs to use. |
|
results_dir |
string |
The path to a folder where the experiment outputs should be written. |
||
batch_size |
unsigned int |
The batch size of inference/evaluate. |
Use the following command to run inference on NV-DINOv2:
INFERENCE_JOB_ID=$(tao-client nvdinov2 experiment-run-action --action inference --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
tao model nvdinov2 inference [-h] -e <experiment_spec_file>
inference.checkpoint=<inference model>
[inference.<evaluate_option>=<evaluate_option_value>]
[inference.gpu_ids=<gpu indices>]
[inference.num_gpus=<number of gpus>]
Required Arguments
The following arguments are required:
-e, --experiment_spec_file
: The experiment spec file to set up the evaluation experiment.inference.checkpoint
: The.pth
model to run inference on.
Optional Arguments
The following arguments are optional:
inference.<inference_option>
: The inference options.
Export the Model#
Creating an Export Experiment Spec File#
Here is an example spec file for inference and export of a ViT large NV-DINOv2:
encryption_key: tlt_encode
results_dir: /path/to/experiment_results
model:
distill:
enable: False
disable_masking: False
pretrained_non_distill_pl_model_path: null
backbone:
teacher_type: "vit_l"
student_type: "vit_l"
drop_path_rate: 0.4
patch_size: 14
img_size: 518
head:
num_layers: 3
hidden_dim: 2048
bottleneck_dim: 384
dataset:
train_dataset:
images_dir: "???"
test_dataset:
images_dir: /path/to/img_dir
batch_size: 16
workers: 10
transform:
n_global_crops: 2
global_crops_scale: [0.32, 1.0]
global_crops_size: 224
n_local_crops: 8
local_crops_scale: [0.05, 0.32]
local_crops_size: 98
export:
gpu_id: 0
checkpoint: /path/to/model.pth
onnx_file: /path/to/model.onnx
input_width: 518
input_height: 518
Parameter |
Datatype |
Default |
Description |
Supported Values |
---|---|---|---|---|
checkpoint |
string |
The path to the PyTorch model to export. |
||
onnx_file |
string |
The path to the |
||
opset_version |
unsigned int |
12 |
The opset version of the exported ONNX. |
>0 |
input_channel |
unsigned int |
3 |
The input channel size. Only the value 3 is supported. |
3 |
input_width |
unsigned int |
128 |
The input width. |
>0 |
input_height |
unsigned int |
512 |
The input height. |
>0 |
batch_size |
unsigned int |
-1 |
The batch size of the ONNX model. If this value is set to -1, the export uses dynamic batch size. |
>=-1 |
gpu_id |
unsigned int |
0 |
The GPU ID to use. |
|
on_cpu |
bool |
False |
If set to |
|
verbose |
bool |
False |
If set to |
Use the following command to export the model:
EXPORT_JOB_ID=$(tao-client nvdinov2 experiment-run-action --action export --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
tao model nvdinov2 export [-h] -e <experiment_spec>
export.checkpoint=<model to export>
export.onnx_file=<onnx path>
[export.<export_option>=<export_option_value>]
Required Arguments
The following arguments are required:
-e, --experiment_spec
: The path to an experiment spec fileexport.checkpoint
: The.pth
model to export.export.onnx_file
: The path where the.etlt
or.onnx
model is saved.
Optional Arguments
The following arguments are optional:
export.<export_option>
: The export options.