Nv-DINOv2#

Introduction#

DINOv2 (Distillation with No Labels v2) is an advanced self-supervised learning method designed to train vision transformers without relying on labeled data. It builds on the original DINO framework by introducing architectural and training improvements that lead to more robust, transferable, and semantically rich feature representations.

At its core, DINOv2 uses a self-distillation strategy, where a student model learns from a momentum-updated teacher model. The two models receive different augmented views of the same image, and the student is trained to align its representations with those of the teacher. This encourages the model to develop a consistent understanding of the underlying image content, independent of transformations such as cropping, scaling, or color jitter.

DINOv2 has proven effective in capturing high-level semantic features, and has demonstrated strong performance across a wide range of downstream tasks without the need for task-specific supervision.

Benefits#

  • No labels required: Enables training on large, unlabeled image datasets with minimal human intervention.

  • High-quality features: Learns semantic-rich, transferable representations suitable for tasks like classification, detection, and segmentation.

  • Robust to augmentation: The use of multiple views during training enhances generalization and invariance to common image transformations.

Data Input for Nv-DINOv2#

Nv-DINOv2 expects input data to be RGB images stored in a single directory. Supported image formats include: .jpg, .jpeg, .png, .ppm, .bmp, .pgm, .tif, .tiff, and .webp.

Creating a Training Experiment Spec File#

Configuring a Custom Dataset#

This section provides example configuration and commands for training Nv-DINOv2 using the dataset format described above.

SPECS=$(tao-client nvdinov2 get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Parameter

Data Type

Default

Description

Supported Values

model

dict config

The configuration of the model architecture.

dataset

dict config

The configuration of the dataset.

train

dict config

The configuration of the training task.

inference

dict config

The configuration of the inference task.

encryption_key

string

None

The encryption key to encrypt and decrypt model files.

results_dir

string

/results

The directory where experiment results are saved.

export

dict config

The configuration of the ONNX export task.

train#

Parameter

Data Type

Default

Description

Supported Values

pretrained_model_path

string

None

Path to a pretrained NVDINOv2 model to initialize training.

layerwise_decay

float

1.0

Layerwise learning rate decay factor.

(0, 1]

clip_grad_norm

float

3.0

Maximum gradient norm for gradient clipping.

> 0

num_prototypes

int

131072

Number of prototypes used in the training.

> 0

precision

string

16-mixed

Mixed precision setting for training.

use_custom_attention

bool

True

If set to True, uses memory-efficient attention mechanism.

True, False

schedulers

dict config

Configuration for learning rate schedulers.

optim


dict config

string

adamw
Configuration for the optimizer.

* optim : Optimizer type.


adamw

schedulers#

Parameter

Data Type

Default

Description

Supported Values

learning_rate

dict config

Learning rate scheduler configuration.

last_layer_learning_rate

dict config

Last layer learning rate scheduler configuration.

weight_decay

dict config

Weight decay scheduler configuration.

momentum

dict config

Momentum scheduler configuration.

teacher_temperature

dict config

Teacher temperature scheduler configuration.

learning_rate#

Parameter

Data Type

Default

Description

Supported Values

val_base

float

7.07e-6

The value after warm-up for scheduler.

> 0

val_final

float

1e-6

Final value for scheduler.

> 0

val_start

float

0.0

Starting value for scheduler.

>= 0

warm_up_steps

int

100000

Number of warm-up steps.

>= 0

max_decay_steps

int

2500000

Maximum number of steps over which the value decays.

> 0

last_layer_learning_rate#

Parameter

Data Type

Default

Description

Supported Values

val_base

float

7.07e-6

The value after warm-up for scheduler.

> 0

val_final

float

1e-6

Final value for scheduler.

> 0

val_start

float

0.0

Starting value for scheduler.

>= 0

warm_up_steps

int

100000

Number of warm-up steps.

>= 0

max_decay_steps

int

2500000

Maximum number of steps over which the value decays.

> 0

freeze_steps

int

1250

Number of initial steps during which the layer is frozen.

>= 0

weight_decay#

Parameter

Data Type

Default

Description

Supported Values

val_base

float

0.04

The value after warm-up for scheduler.

> 0

val_final

float

0.2

Final value for scheduler.

> 0

val_start

float

0.0

Starting value for scheduler.

>= 0

warm_up_steps

int

0

Number of warm-up steps.

>= 0

max_decay_steps

int

2500000

Maximum number of steps over which the value decays.

> 0

momentum#

Parameter

Data Type

Default

Description

Supported Values

val_base

float

0.994

The value after warm-up for scheduler.

> 0

val_final

float

1.0

Final value for scheduler.

> 0

val_start

float

0.0

Starting value for scheduler.

>= 0

warm_up_steps

int

0

Number of warm-up steps.

>= 0

max_decay_steps

int

2500000

Maximum number of steps over which the value decays.

> 0

teacher_temperature#

Parameter

Data Type

Default

Description

Supported Values

val_base

float

0.07

The value after warm-up for scheduler.

> 0

val_final

float

0.07

Final value for scheduler.

> 0

val_start

float

0.04

Starting value for scheduler.

>= 0

warm_up_steps

int

37500

Number of warm-up steps.

>= 0

max_decay_steps

int

37500

Maximum number of steps over which the value decays.

> 0

model#

Parameter

Data Type

Default

Description

Supported Values

distill

dict config

Configuration for the NVDINOv2 distillation module.

backbone

dict config

Configuration for the NVDINOv2 backbone.

head

dict config

Configuration for the NVDINOv2 projection and prediction head.

distill#

Parameter

Data Type

Default

Description

Supported Values

enable

bool

False

If set to True, runs distillation.

True, False

disable_masking

bool

False

If set to True, disables masking when distillation is enabled.

True, False

pretrained_non_distill_pl_model_path

string

None

Path to a non-distilled SSL-trained DINOv2 model used to initialize the teacher model.

backbone#

Parameter

Data Type

Default

Description

Supported Values

teacher_type

string

vit_l

The teacher backbone type.

vit_l, vit_b, vit_s

student_type

string

vit_l

The student backbone type.

vit_l, vit_b, vit_s

num_register_tokens

int

0

Number of register tokens.

≥ 0

drop_path_rate

float

0.4

Drop path rate for stochastic depth regularization.

[0, 1)

patch_size

int

14

Size of input patches.

14, 16

img_size

int

518

Input image size used in the backbone.

224, 518

dataset#

Parameter

Data Type

Default

Description

Supported Values

train_dataset

dict config
string
None
/data
Configuration for the training dataset.
* images_dir : Path to images directory for training dataset.


test_dataset

dict config
string
None
/data
Configuration for the testing dataset.
* images_dir : Path to images directory for testing dataset.


batch_size

int

4

The batch size for training.

≥ 1

pin_memory

bool

True

If set to True, enables page-locked memory for faster CPU-GPU data transfer.

True, False

workers

int

8

Number of parallel workers used in data loading.

≥ 1

transform






dict config
int
list[float]
int
int
list[float]
int
2
[0.32, 1.0]
224
8
[0.05, 0.32]
98
Configuration parameters for data transformation:
* n_global_crops : Number of global crops to generate.
* global_crops_scale : Scale range for global crops.
* global_crops_size : Size (in pixels) of global crops.
* n_local_crops : Number of local crops to generate.
* local_crops_scale : Scale range for local crops.
* local_crops_size : Size (in pixels) of local crops.

≥ 1
Range: (0.0, 1.0]
≥ 1
≥ 1
Range: (0.0, 1.0]
≥ 1

Example spec File for Distillation from ViT Large to Vit Base#

encryption_key: tlt_encode
  results_dir: /path/to/experiment_results
  model:
    distill:
      enable: True
      disable_masking: False
      pretrained_non_distill_pl_model_path: /path/to/pretrained/model.pth
    backbone:
      teacher_type: "vit_l"
      student_type: "vit_l"
      drop_path_rate: 0.4
      patch_size: 14
      img_size: 518
    head:
      num_layers: 3
      hidden_dim: 2048
      bottleneck_dim: 384
  dataset:
    train_dataset:
      images_dir: /path/to/img_dir
    test_dataset:
      images_dir: "???"
    batch_size: 16
    workers: 10
    transform:
      n_global_crops: 2
      global_crops_scale: [0.32, 1.0]
      global_crops_size: 224
      n_local_crops: 8
      local_crops_scale: [0.05, 0.32]
      local_crops_size: 98
  train:
    resume_training_checkpoint_path: null
    pretrained_model_path: /path/to/pretrained/model.pth
    num_nodes: 1
    num_gpus: 1
    num_epochs: 10
    checkpoint_interval: 1
    layerwise_decay: 1.0
    clip_grad_norm: 3.0
    optim:
      optim: "adamw"
    schedulers:
      learning_rate:
        val_base: "${eval: '2e-4 * (${dataset.batch_size} * ${train.num_gpus} * ${train.num_nodes} / 1024) ** (1/2)'}"
        val_final: 1e-6
        warm_up_steps: 100000
      last_layer_learning_rate:
        val_base: "${eval: '2e-4 * (${dataset.batch_size} * ${train.num_gpus} * ${train.num_nodes} / 1024) ** (1/2)'}"
        val_final: 1e-6
        warm_up_steps: 100000
        freeze_steps: 1250
      weight_decay:
        val_base: 0.04
        val_final: 0.2
      momentum:
        val_base: 0.994
        val_final: 1.0
      teacher_temperature:
        val_base: 0.07
        val_final: 0.07
        val_start: 0.04
        warm_up_steps: 37500
    num_prototypes: 131072
    results_dir: "${results_dir}/train"
  inference:
    checkpoint: "???"
  export:
    gpu_id: 0
    checkpoint: "???"
    onnx_file: "???"
    input_width: 518
    input_height: 518

Training the Model#

Use the following command to run NV-DINOv2 training:

TRAIN_JOB_ID=$(tao-client nvdinov2 experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Multi-Node Training with FTMS

Distributed training is supported through FTMS. For large models, multi-node clusters can bring significant speedups and performance improvements for training.

Please verify that your cluster has multiple GPU enabled nodes available for training. You can do this by running the following command:

kubectl get nodes -o wide

You should see multiple nodes listed. If you do not see multiple nodes, please contact your cluster administrator to get more nodes added to your cluster.

To run a multi-node training job through FTMS, you can modify the following fields in the training job spec:

{
    "train": {
        "num_gpus": 8, // Number of GPUs per node
        "num_nodes": 2 // Number of nodes to use for training
    }
}

If these fields are not specified, the default value of 1 GPU per node and 1 node will be used.

Note

The number of GPUs specified in the num_gpus field must not exceed the number of GPUs per node in the cluster. The number of nodes specified in the num_nodes field must not exceed the number of nodes in the cluster.

Distilling the Model#

To distill the model, use the following command:

DISTILL_JOB_ID=$(tao-client nvdinov2 experiment-run-action --action distill --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Inference the Model#

Creating Inference Experiment Spec File#

Here is an example spec file for inference and export a ViT large NV-DINOv2:

encryption_key: tlt_encode
results_dir: /path/to/experiment_results
model:
  distill:
    enable: False
    disable_masking: False
    pretrained_non_distill_pl_model_path: null
  backbone:
    teacher_type: "vit_l"
    student_type: "vit_l"
    drop_path_rate: 0.4
    patch_size: 14
    img_size: 518
  head:
    num_layers: 3
    hidden_dim: 2048
    bottleneck_dim: 384
dataset:
  train_dataset:
    images_dir: "???"
  test_dataset:
    images_dir: /path/to/img_dir
  batch_size: 16
  workers: 10
  transform:
    n_global_crops: 2
    global_crops_scale: [0.32, 1.0]
    global_crops_size: 224
    n_local_crops: 8
    local_crops_scale: [0.05, 0.32]
    local_crops_size: 98
inference:
  checkpoint: /path/to/model.pth
  results_dir: /path/to/experiment_results/inference

Parameter

Datatype

Default

Description

Supported Values

checkpoint

string

Path to PyTorch model to evaluate/inference.

trt_engine

string

Path to TensorRT model to inference/evaluate. Should only be used with TAO Deploy.

num_gpus

unsigned int

1

The number of GPUs to use.

>0

gpu_ids

List[int]

[0]

The GPU IDs to use.

results_dir

string

The path to a folder where the experiment outputs should be written.

batch_size

unsigned int

The batch size of inference/evaluate.

Use the following command to run inference on NV-DINOv2:

INFERENCE_JOB_ID=$(tao-client nvdinov2 experiment-run-action --action inference --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Export the Model#

Creating an Export Experiment Spec File#

Here is an example spec file for inference and export of a ViT large NV-DINOv2:

encryption_key: tlt_encode
results_dir: /path/to/experiment_results
model:
  distill:
    enable: False
    disable_masking: False
    pretrained_non_distill_pl_model_path: null
  backbone:
    teacher_type: "vit_l"
    student_type: "vit_l"
    drop_path_rate: 0.4
    patch_size: 14
    img_size: 518
  head:
    num_layers: 3
    hidden_dim: 2048
    bottleneck_dim: 384
dataset:
  train_dataset:
    images_dir: "???"
  test_dataset:
    images_dir: /path/to/img_dir
  batch_size: 16
  workers: 10
  transform:
    n_global_crops: 2
    global_crops_scale: [0.32, 1.0]
    global_crops_size: 224
    n_local_crops: 8
    local_crops_scale: [0.05, 0.32]
    local_crops_size: 98
export:
  gpu_id: 0
  checkpoint: /path/to/model.pth
  onnx_file: /path/to/model.onnx
  input_width: 518
  input_height: 518

Parameter

Datatype

Default

Description

Supported Values

checkpoint

string

The path to the PyTorch model to export.

onnx_file

string

The path to the .onnx file.

opset_version

unsigned int

12

The opset version of the exported ONNX.

>0

input_channel

unsigned int

3

The input channel size. Only the value 3 is supported.

3

input_width

unsigned int

128

The input width.

>0

input_height

unsigned int

512

The input height.

>0

batch_size

unsigned int

-1

The batch size of the ONNX model. If this value is set to -1, the export uses dynamic batch size.

>=-1

gpu_id

unsigned int

0

The GPU ID to use.

on_cpu

bool

False

If set to True, exports the model on CPU.

verbose

bool

False

If set to True, prints a human-readable representation of the network.

Use the following command to export the model:

EXPORT_JOB_ID=$(tao-client nvdinov2 experiment-run-action --action export --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.