Visual ChangeNet-Classification#

Visual ChangeNet-Classification is an NVIDIA-developed classification change detection model and is included in the TAO. Visual ChangeNet supports the following tasks:

train
evaluate
inference
export
quantize

Each task is explained in detail in the following sections.

Data Input for VisualChangeNet#

Single Golden Data Format#

VisualChangeNet-Classification requires the data to be provided as image and CSV files. Refer to the Data Annotation Format page for more information about the input data format for VisualChangeNet-Classification, which follows the same input data format as Optical Inspection.

Multiple Golden Data Format#

To enable Multiple Golden mode, set num_golden > 1 in the Dataset Configuration. This mode requires a different data format to support multiple golden reference images per sample. Refer to the Data Annotation Format page for more information about the input data format for Multiple-Golden-VisualChangeNet-Classification.

Creating a Training Experiment Specification File#

Configuring a Custom Dataset#

This section provides example configuration and commands to retrieve configuration for training VisualChangeNet-Classification using the dataset format described above.

Note

Make sure to set task=classify in SPECS for all task specs.

Parameter	Data Type	Default	Description	Supported Values
model	dict config	–	The configuration of the model architecture.
dataset	dict config	–	The configuration of the dataset.
train	dict config	–	The configuration of the training task.
evaluate	dict config	–	The configuration of the evaluation task.
inference	dict config	–	The configuration of the inference task.
encryption_key	string	None	The encryption key to encrypt and decrypt model files.
results_dir	string	/results	The directory where experiment results are saved.
export	dict config	–	The configuration of the ONNX export task.
task	str	classify	A flag to indicate the change detection task. Supports two tasks: ‘segment’ and ‘classify’ for segmentation and classification.	classify, segment

train#

Parameter	Datatype	Default	Description	Supported Values
num_gpus	unsigned int	1	The number of GPUs to use for distributed training.	>0
gpu_ids	List[int]	[0]	The indices of the GPU’s to use for distributed training.
seed	unsigned int	1234	The random seed for random, NumPy, and torch.	>0
num_epochs	unsigned int	10	The total number of epochs to run the experiment.	>0
checkpoint_interval	unsigned int	1	The epoch interval at which the checkpoints are saved.	>0
validation_interval	unsigned int	1	The epoch interval at which the validation is run.	>0
resume_training_checkpoint_path	string		The intermediate PyTorch Lightning checkpoint from which to resume training.
results_dir	string	/results/train	The directory in which to save training results.
classify	Dict str list	None ce	The classify dict contains configurable parameters for the VisualChangeNet Classification pipeline with the following parameters: * loss: The loss function used for classification training. * cls_weights: Weights for Cross-Entropy Loss for unbalanced dataset distributions.
segment	Dict str list	None ce [0.5, 0.5, 0.5, 0.8, 1.0]	The segment dict contains configurable parameters for the VisualChangeNet Segmentation pipeline with the following parameters: * loss: The loss function used for segmentation training.
num_nodes	unsigned int	1	The number of nodes. If larger than 1, multi-node is enabled.
pretrained_model_path	string	–	The path to the pretrained model checkpoint to initialize the end-end model weights.
optim	`dict` `config`	None	Contains the configurable parameters for the VisualChangeNet optimizer detailed in the optim section.
tensorboard	`dict` config bool	None True	Enable TensorBoard visualisation using a dict with configurable parameters: * enabled: If set to `True`, enables TensorBoard.

optim#

optim:
  lr: 0.0001
  optim: "adamw"
  policy: "linear"
  momentum: 0.9
  weight_decay: 0.01

Parameter	Datatype	Default	Description	Supported Values
lr	float	0.0005	The learning rate.	>=0.0
optim	str	adamw	The optimizer.
policy	str	linear	The learning scheduler: * linear : LambdaLR decreases the lr by a multiplicative factor. * step : StepLR decrease the lr by 0.1 at every `num_epochs // 3` steps.	linear/step
momentum	float	0.9	The momentum for the AdamW optimizer.
weight_decay	float	0.1	The weight decay coefficient.
monitor_name	str	val_loss	The name of the monitor used for saving the top-k checkpoints.

Model#

The following example model config provides options to change the VisualChangeNet-Classification architecture for training. VisualChangeNet-Classification supports two model architectures. Architecture 1 (difference_module = euclidean) leverages only the last feature maps from the FAN backbone using Euclidean difference to perform contrastive learning. Architecture 2 (difference_module = learnable) leverages the VisualChangeNet-Classification learnable difference modules for 4 different features at 3 feature resolutions to minimize Cross-Entropy loss.

model:
  backbone:
    type: "fan_small_12_p4_hybrid"
    pretrained_backbone_path: null
    freeze_backbone: False
  decode_head:
    feature_strides: [4, 8, 16, 16]
    align_corner: False
    use_summary_token: True
  classify:
    train_margin_euclid: 2.0
    eval_margin: 0.005
    embedding_vectors: 5
    embed_dec: 30
    difference_module: 'learnable'
    learnable_difference_modules: 4

Parameter	Datatype	Default	Description	Supported Values
backbone	Dict string bool bool	None None False False	A dictionary containing the following configurable parameters for VisualChangeNet-Classification backbone: * type: The name of the backbone to be used. * pretrained_backbone_path: The path to pre-trained backbone weights file. * freeze_backbone: If set to `True`, freezes the backbone weights during training. * feat_downsample: If set to `True`, downsamples the last feature map in FAN backbone configurations. This parameter is not propagated to other backbones.	fan_tiny_8_p4_hybrid fan_large_16_p4_hybrid fan_small_12_p4_hybrid fan_base_16_p4_hybrid vit_large_nvdinov2 c_radio_p1_vit_huge_patch16_224_mlpnorm c_radio_p2_vit_huge_patch16_224_mlpnorm c_radio_p3_vit_huge_patch16_224_mlpnorm c_radio_v2_vit_huge_patch16_224 c_radio_v2_vit_large_patch16_224 c_radio_v2_vit_base_patch16_224
decode_head	Dict bool bool list Dict int	None False True [4, 8, 16, 16] 256	A dictionary containing the following configurable parameters for the decoder: * align_corners: If set to `True`, the input and output tensors are aligned by the center points of their corner pixels, preserving the values at the corner pixels. * use_summary_token: If set to `True`, uses the summary token of the backbone. * feature_strides: The downsampling feature strides for different backbones. * decoder_params: Contains the following network parameters: – embed_dims: The embedding dimensions.	True, False True, False >0
classify	Dict string	None 2.0 5 30 learnable 4	A dictionary containing the following configurable parameters for VisualChangeNet-Classification model: * train_margin_euclid: The training margin threshold for contrastive learning (applicable for Architecture 1). * eval_margin: The evaluation margin threshold. * embedding_vectors: The output embedding dimension for each input image before computing Euclidean distance (applicable to Architecture 1). * embed_dec: The transformer decoder MLP embedding dimension (applicable to Architecture 2). * difference_module: The type of difference module used (applicable to both architectures). * learnable_difference_modules: The number of learnable difference modules (applicable to Architecture 2).	>0 >0 >0 >0 euclidean, learnable <4

Dataset#

The dataset parameter defines the dataset source, training batch size, augmentation, and pre-processing. An example dataset is provided below.

dataset:
  classify:
    train_dataset:
      csv_path: /path/to/train.csv
      images_dir: /path/to/img_dir
    validation_dataset:
      csv_path: /path/to/val.csv
      images_dir: /path/to/img_dir
    test_dataset:
      csv_path: /path/to/test.csv
      images_dir: /path/to/img_dir
    infer_dataset:
      csv_path: /path/to/infer.csv
      images_dir: /path/to/img_dir
    image_ext: .jpg
    batch_size: 16
    workers: 2
    fpratio_sampling: 0.2
    num_input: 4
    input_map:
      LowAngleLight: 0
      SolderLight: 1
      UniformLight: 2
      WhiteLight: 3
    concat_type: linear
    grid_map:
      x: 2
      y: 2
    image_width: 128
    image_height: 128
    augmentation_config:
      rgb_input_mean: [0.485, 0.456, 0.406]
      rgb_input_std: [0.229, 0.224, 0.225]
    num_classes: 2

* Refer to the Dataset Annotation Format definition for more information about specifying lighting conditions.

Parameter	Datatype	Default	Description	Supported Values
segment	Dict	–	The `segment` contains dataset config for the segmentation dataloader.
classify	Dict	–	The `classify` contains dataset config for the classification dataloader detailed in the classify section.

classify#

Parameter	Datatype	Default	Description	Supported Values
train_dataset	Dict	–	The paths to the image directory and CSV files for the training dataset.
validation_dataset	Dict	–	The paths to the image directory and CSV files for the validation dataset.
test_dataset	Dict	–	The paths to the image directory and CSV files for the test dataset.
infer_dataset	Dict	–	The paths to the image directory and CSV files for the inference dataset.
image_ext	str	.jpg	The file extension of the images in the dataset.	string
batch_size	int	32	The number of samples per batch.	string
workers	int	8	The number of worker processes for data loading.
fpratio_sampling	int	0.1	The ratio of false-positive examples to sample.	>0
num_input	int	4	The number of lighting conditions for each input image*.	>0
input_map	Dict	–	The mapping of lighting conditions to indices specifying concatenation ordering*.
concat_type	string	linear	Type of concatenation to use for different image lighting conditions.	linear, grid
grid_map	Dict Dict Dict	None None None	The parameters to define the grid dimensions to concatenate images as a grid: * x: The number of images along the x-axis. * y: The number of images along the y-axis.	Dict
input_width	int	100	The width of the input image.	>0
input_height	int	100	The height of the input image.	>0
num_classes	int	2	The number of classes in the dataset.	>1
augmentation_config	Dict	None	Dictionary containing various data augmentation settings, which is detailed in the augmentation section.
num_golden	int	1	Number of golden images to use per input image. Setting this value greater than 1 enables Multiple Golden mode. Multiple Golden mode is only supported with ViT backbones, using `input_width = input_height = 224` and `input_map = None`. In Multiple Golden mode, the dataset must follow the multiple golden data format.	>0

augmentation_config#

Parameter	Datatype	Default	Description	Supported Values
random_flip	Dict float float bool	None 0.5 0.5 True	Random vertical and horizontal flipping augmentation settings. * vflip_probability: Probability of vertical flipping. * hflip_probability: Probability of horizontal flipping. * enable: If set to `True`, enables random flipping augmentation.	>=0.0 >=0.0
random_rotate	Dict float list bool	None 0.5 [90, 180, 270] True	Random rotation augmentation settings. * rotate_probability: Probability of applying random rotation. * angle_list: List of rotation angles to choose from. * enable: If set to `True`, enables random rotation augmentation.	>=0.0 >=0.0
random_color	Dict float float float float bool float	None 0.3 0.3 0.3 0.3 True 0.5	Random color augmentation settings. * brightness: Maximum brightness change factor. * contrast: Maximum contrast change factor. * saturation: Maximum saturation change factor. * hue: Maximum hue change factor. * enabled: If set to `True`, enables random color augmentation. * color_probability: Probability of applying color augmentation.	>=0.0 >=0.0 >=0.0 >=0.0 >=0.0
with_random_crop	bool	True	If set to `True`, applies random crop augmentation.	True, False
with_random_blur	bool	True	If set to `True`, applies random blurring augmentation.	True, False
rgb_input_mean	List[float]	[0.485, 0.456, 0.406]	The mean to be subtracted for pre-processing.
rgb_input_std	List[float]	[0.229, 0.224, 0.225]	The standard deviation to divide the image by.
augment	bool	False	If set to `True`, applies data augmentations.	True, False

Example Specification File for ViT Backbones#

Note

The following specification file is only relevant for TAO versions 5.3 and later.

Creating a Testing Experiment Specification File#

Here is an example specification file for testing evaluation and inference of a trained VisualChangeNet-Classification model.

Inference/Evaluate#

Parameter	Datatype	Default	Description	Supported Values
checkpoint	string		Path to PyTorch model to evaluate/inference.
trt_engine	string		Path to TensorRT model to inference/evaluate.
num_gpus	unsigned int	1	The number of GPUs to use.	>0
gpu_ids	unsigned int	[0]	The GPU IDs to use.
results_dir	string		The path to a folder where the experiment outputs should be written.
vis_after_n_batches	unsigned int	1	Number of batches after which to save inference/evaluate visualization results.	>0
batch_size	unsigned int		The batch size of inference/evaluate.

Evaluating the Model#

Multi-GPU evaluation is currently not supported for Visual ChangeNet Classify.

Exporting the Model#

Here is an example specification file for exporting the trained VisualChangeNet model:

Parameter	Datatype	Default	Description	Supported Values
checkpoint	string		The path to the PyTorch model to export.
onnx_file	string		The path to the `.onnx` file.
opset_version	unsigned int	12	The opset version of the exported ONNX.	>0
input_channel	unsigned int	3	The input channel size. Only the value 3 is supported.	3
input_width	unsigned int	128	The input width.	>0
input_height	unsigned int	512	The input height.	>0
batch_size	unsigned int	-1	The batch size of the ONNX model. If this value is set to -1, the export uses dynamic batch size.	>=-1
gpu_id	unsigned int	0	The GPU ID to use.
on_cpu	bool	False	If set to `True`, exports the model on CPU.
verbose	bool	False	If set to `True`, prints a human-readable representation of the network.

Quantization#

Visual ChangeNet-Classification supports PTQ via TAO Quant using either the torchao (weight-only) or modelopt (static PTQ) backends.

Add a quantize section to your experiment specification (see TAO Quant documentation for schema and backend options).
Use the quantized checkpoint by setting evaluate.is_quantized: true or inference.is_quantized: true and pointing to the artifact saved under results_dir (for example, quantized_model_torchao.pth or quantized_model_modelopt.pth). For ModelOpt artifacts, the model weights are stored under model_state_dict.

Notes#

For modelopt static PTQ, ensure that your dataset configuration provides a representative calibration loader.
For torchao, activation settings in the configuration are ignored.

Calibration Dataset (ModelOpt)#

When you use the modelopt backend (static PTQ), provide a calibration dataset via dataset.classify.quant_calibration_dataset.

Minimal example:

quantize:
  backend: "modelopt"
  mode: "static_ptq"
  algorithm: "minmax"
dataset:
  classify:
    quant_calibration_dataset:
      images_dir: "/path/to/calib/images"

See also: TAO Quant overview and its Configuration and backend pages.