Nv-DINOv2#

Introduction#

DINOv2 (Distillation with No Labels v2) is an advanced self-supervised learning method designed to train vision transformers without relying on labeled data. It builds on the original DINO framework by introducing architectural and training improvements that lead to more robust, transferable, and semantically rich feature representations.

At its core, DINOv2 uses a self-distillation strategy, where a student model learns from a momentum-updated teacher model. The two models receive different augmented views of the same image, and the student is trained to align its representations with those of the teacher. This encourages the model to develop a consistent understanding of the underlying image content, independent of transformations such as cropping, scaling, or color jitter.

DINOv2 has proven effective in capturing high-level semantic features, and has demonstrated strong performance across a wide range of downstream tasks without the need for task-specific supervision.

Benefits#

No labels required: Enables training on large, unlabeled image datasets with minimal human intervention.
High-quality features: Learns semantic-rich, transferable representations suitable for tasks like classification, detection, and segmentation.
Robust to augmentation: The use of multiple views during training enhances generalization and invariance to common image transformations.

Data Input for Nv-DINOv2#

Nv-DINOv2 expects input data to be RGB images stored in a single directory. Supported image formats include: .jpg, .jpeg, .png, .ppm, .bmp, .pgm, .tif, .tiff, and .webp.

Creating a Training Experiment Spec File#

Configuring a Custom Dataset#

This section provides example configuration and commands for training Nv-DINOv2 using the dataset format described above.

SPECS=$(tao-client nvdinov2 get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Here is an example spec file for training a ViT large NV-DINOv2:

encryption_key: tlt_encode
results_dir: /path/to/experiment_results
model:
  distill:
    enable: False
    disable_masking: False
    pretrained_non_distill_pl_model_path: null
  backbone:
    teacher_type: "vit_l"
    student_type: "vit_l"
    drop_path_rate: 0.4
    patch_size: 14
    img_size: 518
  head:
    num_layers: 3
    hidden_dim: 2048
    bottleneck_dim: 384
dataset:
  train_dataset:
    images_dir: /path/to/img_dir
  test_dataset:
    images_dir: "???"
  batch_size: 16
  workers: 10
  transform:
    n_global_crops: 2
    global_crops_scale: [0.32, 1.0]
    global_crops_size: 224
    n_local_crops: 8
    local_crops_scale: [0.05, 0.32]
    local_crops_size: 98
train:
  resume_training_checkpoint_path: null
  pretrained_model_path: /path/to/pretrained/model.pth
  num_nodes: 1
  num_gpus: 1
  num_epochs: 10
  checkpoint_interval: 1
  layerwise_decay: 1.0
  clip_grad_norm: 3.0
  optim:
    optim: "adamw"
  schedulers:
    learning_rate:
      val_base: "${eval: '2e-4 * (${dataset.batch_size} * ${train.num_gpus} * ${train.num_nodes} / 1024) ** (1/2)'}"
      val_final: 1e-6
      warm_up_steps: 100000
    last_layer_learning_rate:
      val_base: "${eval: '2e-4 * (${dataset.batch_size} * ${train.num_gpus} * ${train.num_nodes} / 1024) ** (1/2)'}"
      val_final: 1e-6
      warm_up_steps: 100000
      freeze_steps: 1250
    weight_decay:
      val_base: 0.04
      val_final: 0.2
    momentum:
      val_base: 0.994
      val_final: 1.0
    teacher_temperature:
      val_base: 0.07
      val_final: 0.07
      val_start: 0.04
      warm_up_steps: 37500
  num_prototypes: 131072
  results_dir: "${results_dir}/train"
inference:
  checkpoint: "???"
export:
  gpu_id: 0
  checkpoint: "???"
  onnx_file: "???"
  input_width: 518
  input_height: 518

Parameter	Data Type	Default	Description
model	dict config	–	The configuration of the model architecture.
dataset	dict config	–	The configuration of the dataset.
train	dict config	–	The configuration of the training task.
inference	dict config	–	The configuration of the inference task.
encryption_key	string	None	The encryption key to encrypt and decrypt model files.
results_dir	string	/results	The directory where experiment results are saved.
export	dict config	–	The configuration of the ONNX export task.

train#

Parameter	Data Type	Default	Description	Supported Values
pretrained_model_path	string	None	Path to a pretrained NVDINOv2 model to initialize training.
layerwise_decay	float	1.0	Layerwise learning rate decay factor.	(0, 1]
clip_grad_norm	float	3.0	Maximum gradient norm for gradient clipping.	> 0
num_prototypes	int	131072	Number of prototypes used in the training.	> 0
precision	string	16-mixed	Mixed precision setting for training.
use_custom_attention	bool	True	If set to `True`, uses memory-efficient attention mechanism.	True, False
schedulers	dict config	–	Configuration for learning rate schedulers.
optim	dict config string	– adamw	Configuration for the optimizer. * optim : Optimizer type.	adamw

schedulers#

Parameter	Data Type	Default	Description
learning_rate	dict config	–	Learning rate scheduler configuration.
last_layer_learning_rate	dict config	–	Last layer learning rate scheduler configuration.
weight_decay	dict config	–	Weight decay scheduler configuration.
momentum	dict config	–	Momentum scheduler configuration.
teacher_temperature	dict config	–	Teacher temperature scheduler configuration.

learning_rate#

Parameter	Data Type	Default	Description	Supported Values
val_base	float	7.07e-6	The value after warm-up for scheduler.	> 0
val_final	float	1e-6	Final value for scheduler.	> 0
val_start	float	0.0	Starting value for scheduler.	>= 0
warm_up_steps	int	100000	Number of warm-up steps.	>= 0
max_decay_steps	int	2500000	Maximum number of steps over which the value decays.	> 0

last_layer_learning_rate#

Parameter	Data Type	Default	Description	Supported Values
val_base	float	7.07e-6	The value after warm-up for scheduler.	> 0
val_final	float	1e-6	Final value for scheduler.	> 0
val_start	float	0.0	Starting value for scheduler.	>= 0
warm_up_steps	int	100000	Number of warm-up steps.	>= 0
max_decay_steps	int	2500000	Maximum number of steps over which the value decays.	> 0
freeze_steps	int	1250	Number of initial steps during which the layer is frozen.	>= 0

weight_decay#

Parameter	Data Type	Default	Description	Supported Values
val_base	float	0.04	The value after warm-up for scheduler.	> 0
val_final	float	0.2	Final value for scheduler.	> 0
val_start	float	0.0	Starting value for scheduler.	>= 0
warm_up_steps	int	0	Number of warm-up steps.	>= 0
max_decay_steps	int	2500000	Maximum number of steps over which the value decays.	> 0

momentum#

Parameter	Data Type	Default	Description	Supported Values
val_base	float	0.994	The value after warm-up for scheduler.	> 0
val_final	float	1.0	Final value for scheduler.	> 0
val_start	float	0.0	Starting value for scheduler.	>= 0
warm_up_steps	int	0	Number of warm-up steps.	>= 0
max_decay_steps	int	2500000	Maximum number of steps over which the value decays.	> 0

teacher_temperature#

Parameter	Data Type	Default	Description	Supported Values
val_base	float	0.07	The value after warm-up for scheduler.	> 0
val_final	float	0.07	Final value for scheduler.	> 0
val_start	float	0.04	Starting value for scheduler.	>= 0
warm_up_steps	int	37500	Number of warm-up steps.	>= 0
max_decay_steps	int	37500	Maximum number of steps over which the value decays.	> 0

model#

Parameter	Data Type	Default	Description
distill	dict config	–	Configuration for the NVDINOv2 distillation module.
backbone	dict config	–	Configuration for the NVDINOv2 backbone.
head	dict config	–	Configuration for the NVDINOv2 projection and prediction head.

distill#

Parameter	Data Type	Default	Description	Supported Values
enable	bool	False	If set to `True`, runs distillation.	True, False
disable_masking	bool	False	If set to `True`, disables masking when distillation is enabled.	True, False
pretrained_non_distill_pl_model_path	string	None	Path to a non-distilled SSL-trained DINOv2 model used to initialize the teacher model.

backbone#

Parameter	Data Type	Default	Description	Supported Values
teacher_type	string	vit_l	The teacher backbone type.	vit_l, vit_b, vit_s
student_type	string	vit_l	The student backbone type.	vit_l, vit_b, vit_s
num_register_tokens	int	0	Number of register tokens.	≥ 0
drop_path_rate	float	0.4	Drop path rate for stochastic depth regularization.	[0, 1)
patch_size	int	14	Size of input patches.	14, 16
img_size	int	518	Input image size used in the backbone.	224, 518

head#

Parameter	Data Type	Default	Description	Supported Values
num_layers	int	3	Number of layers in the NVDINOv2 head.	≥ 1
hidden_dim	int	2048	Dimension of the hidden layers in the NVDINOv2 head.	≥ 1
bottleneck_dim	int	384	Dimension of the bottleneck layer in the NVDINOv2 head.	≥ 1

dataset#

Parameter	Data Type	Default	Description	Supported Values
train_dataset	dict config string	None /data	Configuration for the training dataset. * images_dir : Path to images directory for training dataset.
test_dataset	dict config string	None /data	Configuration for the testing dataset. * images_dir : Path to images directory for testing dataset.
batch_size	int	4	The batch size for training.	≥ 1
pin_memory	bool	True	If set to `True`, enables page-locked memory for faster CPU-GPU data transfer.	True, False
workers	int	8	Number of parallel workers used in data loading.	≥ 1
transform	dict config int list[float] int int list[float] int	– 2 [0.32, 1.0] 224 8 [0.05, 0.32] 98	Configuration parameters for data transformation: * n_global_crops : Number of global crops to generate. * global_crops_scale : Scale range for global crops. * global_crops_size : Size (in pixels) of global crops. * n_local_crops : Number of local crops to generate. * local_crops_scale : Scale range for local crops. * local_crops_size : Size (in pixels) of local crops.	≥ 1 Range: (0.0, 1.0] ≥ 1 ≥ 1 Range: (0.0, 1.0] ≥ 1

Example spec File for Distillation from ViT Large to Vit Base#

encryption_key: tlt_encode
  results_dir: /path/to/experiment_results
  model:
    distill:
      enable: True
      disable_masking: False
      pretrained_non_distill_pl_model_path: /path/to/pretrained/model.pth
    backbone:
      teacher_type: "vit_l"
      student_type: "vit_l"
      drop_path_rate: 0.4
      patch_size: 14
      img_size: 518
    head:
      num_layers: 3
      hidden_dim: 2048
      bottleneck_dim: 384
  dataset:
    train_dataset:
      images_dir: /path/to/img_dir
    test_dataset:
      images_dir: "???"
    batch_size: 16
    workers: 10
    transform:
      n_global_crops: 2
      global_crops_scale: [0.32, 1.0]
      global_crops_size: 224
      n_local_crops: 8
      local_crops_scale: [0.05, 0.32]
      local_crops_size: 98
  train:
    resume_training_checkpoint_path: null
    pretrained_model_path: /path/to/pretrained/model.pth
    num_nodes: 1
    num_gpus: 1
    num_epochs: 10
    checkpoint_interval: 1
    layerwise_decay: 1.0
    clip_grad_norm: 3.0
    optim:
      optim: "adamw"
    schedulers:
      learning_rate:
        val_base: "${eval: '2e-4 * (${dataset.batch_size} * ${train.num_gpus} * ${train.num_nodes} / 1024) ** (1/2)'}"
        val_final: 1e-6
        warm_up_steps: 100000
      last_layer_learning_rate:
        val_base: "${eval: '2e-4 * (${dataset.batch_size} * ${train.num_gpus} * ${train.num_nodes} / 1024) ** (1/2)'}"
        val_final: 1e-6
        warm_up_steps: 100000
        freeze_steps: 1250
      weight_decay:
        val_base: 0.04
        val_final: 0.2
      momentum:
        val_base: 0.994
        val_final: 1.0
      teacher_temperature:
        val_base: 0.07
        val_final: 0.07
        val_start: 0.04
        warm_up_steps: 37500
    num_prototypes: 131072
    results_dir: "${results_dir}/train"
  inference:
    checkpoint: "???"
  export:
    gpu_id: 0
    checkpoint: "???"
    onnx_file: "???"
    input_width: 518
    input_height: 518

Training the Model#

Use the following command to run NV-DINOv2 training:

TRAIN_JOB_ID=$(tao-client nvdinov2 experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Multi-Node Training with FTMS

Distributed training is supported through FTMS. For large models, multi-node clusters can bring significant speedups and performance improvements for training.

Verify that your cluster has multiple GPU enabled nodes available for training by running this command:

kubectl get nodes -o wide

The command lists the nodes in your cluster. If it does not list multiple nodes, contact your cluster administrator to get more nodes added to your cluster.

To run a multi-node training job through FTMS, modify these fields in the training job specification:

{
    "train": {
        "num_gpus": 8, // Number of GPUs per node
        "num_nodes": 2 // Number of nodes to use for training
    }
}

If these fields are not specified, FTMS uses the default values of one GPU per node and one node.

Note

The number of GPUs specified in the num_gpus field must not exceed the number of GPUs per node in the cluster. The number of nodes specified in the num_nodes field must not exceed the number of nodes in the cluster.

tao model nvdinov2 train [-h] -e <experiment_spec>
                          [results_dir=<global_results_dir>]
                          [model.<model_option>=<model_option_value>]
                          [dataset.<dataset_option>=<dataset_option_value>]
                          [train.<train_option>=<train_option_value>]
                          [train.gpu_ids=<gpu indices>]
                          [train.num_gpus=<number of gpus>]

Required Arguments

The following arguments are required:

-e, --experiment_spec_file: The path to the experiment spec file.

Optional Arguments

You can set optional arguments to override the option values in the experiment spec file.

-h, --help: Show this help message and exit.
model.<model_option>: The model options.
dataset.<dataset_option>: The dataset options.
train.<train_option>: The train options.

Distilling the Model#

To distill the model, use the following command:

DISTILL_JOB_ID=$(tao-client nvdinov2 experiment-run-action --action distill --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model nvdinov2 distill [-h] -e <experiment_spec>
                          [results_dir=<global_results_dir>]
                          [model.<model_option>=<model_option_value>]
                          [dataset.<dataset_option>=<dataset_option_value>]
                          [train.<train_option>=<train_option_value>]
                          [train.gpu_ids=<gpu indices>]
                          [train.num_gpus=<number of gpus>]

Required Arguments

The following arguments are required:

-e, --experiment_spec_file: The path to the experiment spec file.

Optional Arguments

You can set optional arguments to override the option values in the experiment spec file.

-h, --help: Show this help message and exit.
model.<model_option>: The model options.
dataset.<dataset_option>: The dataset options.
train.<train_option>: The train options.

Inference the Model#

Creating Inference Experiment Spec File#

Here is an example spec file for inference and export a ViT large NV-DINOv2:

encryption_key: tlt_encode
results_dir: /path/to/experiment_results
model:
  distill:
    enable: False
    disable_masking: False
    pretrained_non_distill_pl_model_path: null
  backbone:
    teacher_type: "vit_l"
    student_type: "vit_l"
    drop_path_rate: 0.4
    patch_size: 14
    img_size: 518
  head:
    num_layers: 3
    hidden_dim: 2048
    bottleneck_dim: 384
dataset:
  train_dataset:
    images_dir: "???"
  test_dataset:
    images_dir: /path/to/img_dir
  batch_size: 16
  workers: 10
  transform:
    n_global_crops: 2
    global_crops_scale: [0.32, 1.0]
    global_crops_size: 224
    n_local_crops: 8
    local_crops_scale: [0.05, 0.32]
    local_crops_size: 98
inference:
  checkpoint: /path/to/model.pth
  results_dir: /path/to/experiment_results/inference

Parameter	Datatype	Default	Description	Supported Values
checkpoint	string		Path to PyTorch model to evaluate/inference.
trt_engine	string		Path to TensorRT model to inference/evaluate. Should only be used with TAO Deploy.
num_gpus	unsigned int	1	The number of GPUs to use.	>0
gpu_ids	List[int]	[0]	The GPU IDs to use.
results_dir	string		The path to a folder where the experiment outputs should be written.
batch_size	unsigned int		The batch size of inference/evaluate.

Use the following command to run inference on NV-DINOv2:

INFERENCE_JOB_ID=$(tao-client nvdinov2 experiment-run-action --action inference --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model nvdinov2 inference [-h] -e <experiment_spec_file>
                       inference.checkpoint=<inference model>
                       [inference.<evaluate_option>=<evaluate_option_value>]
                       [inference.gpu_ids=<gpu indices>]
                       [inference.num_gpus=<number of gpus>]

Required Arguments

The following arguments are required:

-e, --experiment_spec_file: The experiment spec file to set up the evaluation experiment.
inference.checkpoint: The .pth model to run inference on.

Optional Arguments

The following arguments are optional:

inference.<inference_option>: The inference options.

Export the Model#

Creating an Export Experiment Spec File#

Here is an example spec file for inference and export of a ViT large NV-DINOv2:

encryption_key: tlt_encode
results_dir: /path/to/experiment_results
model:
  distill:
    enable: False
    disable_masking: False
    pretrained_non_distill_pl_model_path: null
  backbone:
    teacher_type: "vit_l"
    student_type: "vit_l"
    drop_path_rate: 0.4
    patch_size: 14
    img_size: 518
  head:
    num_layers: 3
    hidden_dim: 2048
    bottleneck_dim: 384
dataset:
  train_dataset:
    images_dir: "???"
  test_dataset:
    images_dir: /path/to/img_dir
  batch_size: 16
  workers: 10
  transform:
    n_global_crops: 2
    global_crops_scale: [0.32, 1.0]
    global_crops_size: 224
    n_local_crops: 8
    local_crops_scale: [0.05, 0.32]
    local_crops_size: 98
export:
  gpu_id: 0
  checkpoint: /path/to/model.pth
  onnx_file: /path/to/model.onnx
  input_width: 518
  input_height: 518

Parameter	Datatype	Default	Description	Supported Values
checkpoint	string		The path to the PyTorch model to export.
onnx_file	string		The path to the `.onnx` file.
opset_version	unsigned int	12	The opset version of the exported ONNX.	>0
input_channel	unsigned int	3	The input channel size. Only the value 3 is supported.	3
input_width	unsigned int	128	The input width.	>0
input_height	unsigned int	512	The input height.	>0
batch_size	unsigned int	-1	The batch size of the ONNX model. If this value is set to -1, the export uses dynamic batch size.	>=-1
gpu_id	unsigned int	0	The GPU ID to use.
on_cpu	bool	False	If set to `True`, exports the model on CPU.
verbose	bool	False	If set to `True`, prints a human-readable representation of the network.

Use the following command to export the model:

EXPORT_JOB_ID=$(tao-client nvdinov2 experiment-run-action --action export --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model nvdinov2 export [-h] -e <experiment_spec>
                          export.checkpoint=<model to export>
                          export.onnx_file=<onnx path>
                          [export.<export_option>=<export_option_value>]

Required Arguments

The following arguments are required:

-e, --experiment_spec: The path to an experiment spec file
export.checkpoint: The .pth model to export.
export.onnx_file: The path where the .etlt or .onnx model is saved.

Optional Arguments

The following arguments are optional:

export.<export_option>: The export options.