OCDNet#
OCDNet is an optical-character detection model that is included in the TAO. It supports the following tasks:
train
evaluate
inference
prune
export
quantize
Each task is explained in detail in the following sections.
Preparing the Dataset#
The dataset for OCDNet contains images and the corresponding label files.
Both the training dataset and test dataset must follow the same structure.
The directory structure should be organized as follows, where the directory name for images is
img and the directory name for label files is gt. By default, the label file is
expected to use gt_ as a prefix for comparison to the corresponding image file.
The exact directory names train and test are not required but are preferred by convention.
/train
/img
img_0.jpg
img_1.jpg
...
/gt
gt_img_0.txt
gt_img_1.txt
...
/test
/img
img_0.jpg
img_1.jpg
...
/gt
gt_img_0.txt
gt_img_1.txt
...
Below is an example label file from the public ICDAR2015 dataset:
$ cat ICDAR2015/test/gt/gt_img_14.txt
268,82,335,93,332,164,267,164,the
344,94,433,112,427,159,336,163,Future
208,191,374,184,371,213,208,241,Communications
370,176,420,176,416,204,373,213,###
1,57,261,76,261,187,0,190,venting
1,208,203,200,203,241,3,294,ntelligence.
Note
The label file contains the cooridnates for all the points. The last one is the text.
If the text is ### and the training specification file sets ignore_tags to ['###'], then those lines are ignored during training.
Creating an Experiment Specification File#
The specification file for OCDNet includes model, train, dataset, and evaluate, as well as
other global parameters. Below is an example specification file for training an OCDNet model with a FAN-tiny backbone
on an ICDAR2015 dataset.
The top level description of the specification file is provided in the table below.
Parameter |
Data Type |
Default |
Description |
Supported Values |
|
dict config |
– |
The configuration of the model architecture |
|
|
dict config |
– |
The configuration of the dataset |
|
|
dict config |
– |
The configuration of the training task |
|
|
dict config |
– |
The configuration of the evaluation task |
|
|
dict config |
– |
The configuration of the inference task |
|
|
string |
None |
The encryption key to encrypt and decrypt model files |
|
|
string |
/results |
The directory where experiment results are saved |
|
|
dict config |
– |
The configuration of the ONNX export task |
|
|
dict config |
– |
The configuration of the TensorRT generation task |
|
|
dict config |
– |
The configuration of the pruning task |
|
|
str |
– |
Model#
The model parameter provides the list of parameters for the model.
Parameter |
Data Type |
Default |
Description |
Supported Values |
|
bool |
|
A flag specifying whether to load the pruned graph. Set to True if train/evaluate/export/inference is being performed against a pruned model. |
true/false |
|
string |
– |
The path to the pruned graph model (if |
unix path |
|
string |
– |
The path to the pretrained model |
unix path |
|
string |
deformable_resnet18 |
The backbone of the model |
deformable_resnet18
deformable_resnet50
fan_tiny_8_p4_hybrid
|
|
bool |
|
A flag specifying whether to enlarge the output feature map size of the FAN-tiny backbone.
This flag has no effect when using a |
true/false |
|
bool |
|
A flag specifying whether to use activation checkpoints to save GPU memory. This flag has no
effect when using a |
true/false true/false |
Train#
The train parameter provides the parameters for training.
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
unsigned int |
1 |
The number of GPUs to use for distributed training |
>0 |
|
List[int] |
[0] |
The indices of the GPU’s to use for distributed training |
|
|
unsigned int |
1234 |
The random seed for random, NumPy, and torch |
>0 |
|
unsigned int |
10 |
The total number of epochs to run the experiment |
>0 |
|
unsigned int |
1 |
The epoch interval at which the checkpoints are saved |
>0 |
|
unsigned int |
1 |
The epoch interval at which the validation is run |
>0 |
|
string |
The intermediate PyTorch Lightning checkpoint to resume training from |
||
|
string |
/results/train |
The directory to save training results |
|
|
dict config |
– |
The configuration for the optimizer |
– |
|
dict config |
– |
The configuration for the lr_scheduler |
– |
|
dict config |
– |
The configuration for post_processing. |
– |
|
dict config |
– |
The configuration for metric computing. QuadMetric is supported.
If |
– |
|
bool |
|
If this flag is True, only one batch will run. This flag is only recommended for debugging purposes. |
true/false |
|
string |
fp32 |
The precision that the model will be trained on. If this value is set to ‘fp16’, AMP training will be enabled |
fp32/fp16 |
|
bool |
|
A flag to enable model EMA. The default value is False. If the value is True, model EMA will be enabled during training |
true/false |
|
float |
0.999 |
The decay of model EMA. The default value is 0.999. This value is only used when
|
(0, 1] |
optimizer#
optimizer:
type: Adam
args:
lr: 0.001
Parameter |
Data Type |
Default |
Description |
Supported Values |
|
string |
Adam |
The optimizer type |
Adam |
|
float |
– |
The initial learning rate |
>=0.0 |
lr_scheduler#
lr_scheduler:
type: WarmupPolyLR
args:
warmup_epoch: 3
Parameter |
Data Type |
Default |
Description |
Supported Values |
|
string |
WarmupPolyLR |
Decays the learning rate via a polynomial function. The learning rate increases to initial value during warmup stage and is reduced from the initial value to zero during the training stage. |
WarmupPolyLR |
|
unsigned int |
3 |
The warmup epoch, which the learning rate increases to the intitial value (i.e.
|
>=0 |
post_processing#
post_processing:
type: SegDetectorRepresenter
args:
thresh: 0.3
box_thresh: 0.55
max_candidates: 1000
unclip_ratio: 1.5
Parameter |
Data Type |
Default |
Description |
Supported Values |
|
string |
SegDetectorRepresenter |
The name of the post_processing. The post_processing will generate BBox or polygon. |
SegDetectorRepresenter |
|
float |
0.3 |
The threshold for binarization, which is used in generating an approximate binary map. |
0.0 ~ 1.0 |
|
float |
0.7 |
The BBox threshold. If the effective area is lower than this threshold, the prediction will be ignored, which means no text is detected. |
0.0 ~ 1.0 |
|
unsigned int |
1000 |
The maximum candidate output. Enlarge this parameter if characters are detected in one area but obviously not in the other area of the image. |
> 1 |
|
float |
1.5 |
The unclip ratio using the Vatti clipping algorithm in the probability map. The BBox will look larger if this ratio is set larger. |
>0.0 |
Dataset#
The dataset is defined by two sections: train_dataset and validate_dataset
Parameter |
Data Type |
Default |
Description |
Supported Values |
|
dict config |
– |
The configuragtion for the training dataset |
– |
|
dict config |
– |
The configuragtion for the validation dataset |
– |
The parameters for train_dataset are provided below.
Parameter |
Data Type |
Default |
Description |
Supported Values |
data_name |
string
|
ICDAR2015Dataset
|
The dataset name. For “ICDAR2015Dataset”, the label file is
expected to use
gt_ as a prefix. For “UberDataset”,the label file is expected to use
truth_ as a prefix. |
ICDAR2015Dataset
UberDataset
|
|
string list |
– |
The list of paths that contain images used for training:
For example, |
– |
|
dict |
– |
The pre-processing configuration (see ) train_preprocess for more details |
– |
|
string |
BGR |
The image mode |
BGR, RGB, GRAY |
|
string list |
|
The keys to ignore |
– |
|
string list |
|
The labels that are not used to train |
– |
|
unsigned int |
False |
The batch size. Set to a lower value if you encounter out-of-memory errors. |
>0 |
|
bool |
False |
A flag specifying whether to enable pinned memory |
true/false |
|
unsigned int |
1 |
The threds used to load data |
>=0 |
train_preprocess#
pre_processes:
- type: IaaAugment
args:
- {'type':Fliplr, 'args':{'p':0.5}}
- {'type': Affine, 'args':{'rotate':[-45,45]}}
- {'type':Sometimes,'args':{'p':0.2, 'then_list':{'type': GaussianBlur, 'args':{'sigma':[1.5,2.5]}}}}
- {'type':Resize,'args':{'size':[0.5,3]}}
- type: EastRandomCropData
args:
size: [640,640]
max_tries: 50
keep_ratio: true
- type: MakeBorderMap
args:
shrink_ratio: 0.4
thresh_min: 0.3
thresh_max: 0.7
- type: MakeShrinkMap
args:
shrink_ratio: 0.4
min_text_size: 8
Parameter |
Data Type |
Default |
Description |
Supported Values |
IaaAugment |
dict list
|
{'type':Fliplr, 'args':{'p':0.5}}{'type': Affine, 'args':{'rotate':[-10,10]}}{'type':Sometimes,'args':{'p':1.0, 'then_list':{'type': GaussianBlur, 'args':{'sigma':[1.5,2.5]}}}}{'type':Resize,'args':{'size':[0.5,3]}} |
Uses imgaug to perform augmentation. “Fliplr”, “Affine”, “Sometimes”, “GaussianBlur” and “Resize” are used by default.
p defines the probability of each image to be flipped. rotate defines the degree range when rotating images by a random value.Sometimes defines only p percent of all images with one or more augmenters. then_list defines the Augmenter(s) to apply to p percent of all imagesGaussianBlur defines the blur using gaussian kernels. sigma defines the standard deviation of the gaussian kernel.size defines the range when resizing each image compared to its original size. |
|
|
dict config |
– – |
The ramdom crop after augmentation. |
|
|
dict config |
– |
Defines the parameter when generating a threshold map. |
0.0 ~ 1.0 |
|
dict config |
– |
Defines the parameter when generating a probability map. |
0.0 ~ 1.0 |
The parameters for validate_dataset are similar to train_dataset, except below validation_preprocess.
validation_preprocess#
pre_processes:
- type: Resize2D
args:
short_size:
- 1280
- 736
resize_text_polys: true
Parameter |
Data Type |
Default |
Description |
Supported Values |
|
string |
Resize2D |
Resize the images and labels before evaluation. |
Resize2D |
|
list |
– |
Resize the image to (width x height). |
>0, >0, and multiples of 32. |
|
bool |
– |
A flag specifying whether to resize the text coordinate |
true/false |
Evaluate#
The following is an example specification file for evaluating on the ICDAR2015 dataset:
model:
load_pruned_graph: False
pruned_graph_path: '/results/prune/pruned_0.1.pth'
backbone: deformable_resnet18
evaluate:
results_dir: /results/evaluate
checkpoint: /results/train/model_best.pth
gpu_id: 0
post_processing:
type: SegDetectorRepresenter
args:
box_thresh: 0.55
max_candidates: 1000
unclip_ratio: 1.5
metric:
type: QuadMetric
args:
is_output_polygon: false
dataset:
validate_dataset:
data_path: ['/data/ocdnet/test']
args:
pre_processes:
- type: Resize2D
args:
short_size:
- 1280
- 736
resize_text_polys: true
img_mode: BGR
filter_keys: []
ignore_tags: ['*', '###']
loader:
batch_size: 1
shuffle: false
pin_memory: false
num_workers: 4
Inference#
The following is an example specification file for running infernce:
model:
load_pruned_graph: false
pruned_graph_path: '/results/prune/pruned_0.1.pth'
backbone: deformable_resnet18
inference:
checkpoint: '/results/train/model_best.pth'
input_folder: /data/ocdnet/test/img
width: 1280
height: 736
img_mode: BGR
polygon: false
results_dir: /results/inference
post_processing:
type: SegDetectorRepresenter
args:
thresh: 0.3
box_thresh: 0.55
max_candidates: 1000
unclip_ratio: 1.5
The inference parameter defines the hyper-parameters of the inference process. Inference
draws bounding boxes or polygons and visualizes it in images.
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
string |
– |
The path to the pth model |
Unix path |
|
string |
/results/inference |
The directory to save inference results |
|
|
unsigned int |
1 |
The number of GPUs to use for distributed inference |
>0 |
|
List[int] |
[0] |
The indices of the GPU’s to use for distributed inference |
|
|
string |
– |
The path to the input folder for inference |
Unix path |
|
unsigned int |
– |
The input width |
>=1 |
|
unsigned int |
– |
The input height |
>=1 |
|
string |
– |
The image mode |
BGR/RGB/GRAY |
|
bool |
– |
A True value specifies BBox, while a False value specifies polygon. |
true, false |
Training the Model#
Checkpointing and Resuming Training
At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth.
Checkpoints are saved in train.results_dir, like this:
$ ls /results/train
'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'
The latest checkpoint is also be saved as ocd_model_latest.pth.
Training automatically resumes from ocd_model_latest.pth, if it exists in train.results_dir.
This is superseded by train.resume_training_checkpoint_path, if it is provided.
The major implication of this logic is that, if you wish to trigger fresh training from scratch, either:
Specify a new, empty results directory (Recommended)
Remove the latest checkpoint from the results directory
Note
By default, the training is using DDP (Distributed Data Parallel) strategy.
When training with multi-GPU, the hmean result during training will be the same as a standalone evaluation
only if evaluation images are a multiple of num_gpus * evaluate_batch_size.
Running Inference on the OCDNet Model#
Note
Inference expects existing label files in the gt folder. If there are no label files,
generate dummy labels under the gt folder. Use the following script for reference:
#!/bin/bash
folder_path=/workspace/datasets/ICDAR2015/datasets/test
mkdir -p ${folder_path}/gt
for filename in `ls ${folder_path}/img`; do
touch "${folder_path}/gt/gt_${filename%.*}.txt"
echo "10,10,10,20,20,10,20,20,###" > "${folder_path}/gt/gt_${filename%.*}.txt"
done
Pruning and Retraining an OCDNet Model#
Model pruning reduces model parameters to improve inference frames per second (FPS) while maintaining nearly the same hmean.
Pruning is applied to an already trained OCDNet model. After pruning, the pruned graph model is generated. It is a new model with fewer parameters. After you have this pruned graph model, you must retrain it on the same dataset to bring back the hmean. During retraining, you need to enable loading this pruned graph model and setting the path to this model.
The prune parameter defines the hyperparameters of the pruning process.
prune:
checkpoint: /results/train/model_best.pth
ch_sparsity: 0.2
round_to: 32
p: 2
results_dir: /results/prune
verbose: True
model:
backbone: fan_tiny_8_p4_hybrid
enlarge_feature_map_size: True
fuse_qkv_proj: False
dataset:
validate_dataset:
data_path: ['/data/ocdnet_vit/test']
args:
pre_processes:
- type: Resize2D
args:
short_size:
- 640
- 640
resize_text_polys: true
img_mode: BGR
filter_keys: []
ignore_tags: ['*', '###']
loader:
batch_size: 1
shuffle: false
pin_memory: false
num_workers: 1
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
string |
The path to PyTorch model to prune |
unix path |
|
|
float |
0.1 |
The pruning threshold |
0.0 ~ 1.0 |
|
string |
The path to the results directory |
Unix path |
|
|
unsigned int |
Round channels to the nearest multiple of round_to. E.g., round_to=8 means channels will be rounded to 8x. |
>0 |
|
|
unsigned int |
The norm degree to estimate the importance of channels. Default: 2 |
>0 |
|
|
bool |
A flag whether print prune information, default: True |
true/false |
|
|
bool |
A flag whether fuse the qkv projection, default: True, it’s only needed set to True when using fan-tiny backbone. |
true/false |
After pruning, the pruned model can be used for retraining (that is, fine-tuning). To start the retraining, you need to set
the load_pruned_graph parameter to true and set the pruned_graph_path parameter to point to the
model that is generated from pruning.
Note
When retraining, evaluating, performing inference on, or exporting a model that has a pruned structure, you need
to set load_pruned_graph to true so that the newly pruned model structure is imported.
Exporting the Model#
The export parameter defines the hyperparameters of the export process.
model:
load_pruned_graph: False
pruned_graph_path: '/results/prune/pruned_0.1.pth'
backbone: deformable_resnet18
export:
results_dir: /results/export
checkpoint: '/results/train/model_best.pth'
onnx_file: '/results/export/model_best.onnx'
width: 1280
height: 736
dataset:
validate_dataset:
data_path: ['/data/ocdnet/test']
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
string |
The path to PyTorch model to export |
Unix path |
|
|
string |
The path to ONNX file |
Unix path |
|
|
unsigned int |
11 |
The opset version of the exported ONNX |
>0 |
|
unsigned int |
1280 |
The input width |
>0 |
|
unsigned int |
736 |
The input height |
>0 |
Quantization#
OCDNet supports PTQ via TAO Quant using either the torchao (weight-only) or modelopt (static PTQ) backends.
Add a
quantizesection to your experiment specification (see TAO Quant documentation for schema and backend options).Use the quantized checkpoint by setting
evaluate.is_quantized: trueorinference.is_quantized: trueand pointing to the artifact saved underresults_dir(for example,quantized_model_torchao.pthorquantized_model_modelopt.pth). For ModelOpt artifacts, the model weights are stored undermodel_state_dict.
Notes#
For
modeloptstatic PTQ, ensure that your dataset configuration provides a representative calibration loader.For
torchao, activation settings in the configuration are ignored.
Calibration Dataset (ModelOpt)#
When you use the modelopt backend (static PTQ), provide a calibration dataset via dataset.quant_calibration_dataset.
Minimal example:
quantize:
backend: "modelopt"
mode: "static_ptq"
algorithm: "minmax"
dataset:
quant_calibration_dataset:
images_dir: "/path/to/calib/images"
See also: TAO Quant overview and its Configuration and backend pages.
Deploying to DeepStream#
Refer to the nvOCDR page for more information about deploying an OCDNet model to DeepStream.
You can run nvOCDR with the DeepStream sample or Triton Inference Server. Specifically, nvOCDR Triton can support inference
against high resolution image. In short, it will resize the image while keeping aspect ratio and then tile the image to small patches,
and run OCDNet to get the output then merge the result. This is useful to improve hmean in case a model is trained with a smaller
resolution but will run inference against higher resolution images. For images which are not high resolution, you can also set
resize_keep_aspect_ratio:true, this is useful to improve hmean because the images are resized without distortion.