NVIDIA TAO Toolkit v4.0.1
NVIDIA TAO Release 4.0.1

Data Annotation Format

This page describes the dataset formats for computer-vision apps supported by TAO Toolkit.

Image classification expects a directory of images with the following structure, where each class has its own directory with the class name. The naming convention for train/val/test can be different because the path of each set is individually specified in the spec file. See the Specification File for Classification section for more information.

Copy
Copied!
            

|--dataset_root: |--train |--audi: |--1.jpg |--2.jpg |--bmw: |--01.jpg |--02.jpg |--val |--audi: |--3.jpg |--4.jpg |--bmw: |--03.jpg |--04.jpg |--test |--audi: |--5.jpg |--6.jpg |--bmw: |--05.jpg |--06.jpg

Using the KITTI format requires data to be organized in this structure:

Copy
Copied!
            

. |--dataset root |-- images |-- 000000.jpg |-- 000001.jpg . . |-- xxxxxx.jpg |-- labels |-- 000000.txt |-- 000001.txt . . |-- xxxxxx.txt |-- kitti_seq_to_map.json

Here’s a description of the structure:

  • The images directory contains the images to train on.

  • The labels directory contains the labels to the corresponding images. Details of this file are included in the Label Files section.

    Note

    The images and labels have the same file IDs before the extension. The image to label correspondence is maintained using this file name.

  • The kitti_seq_to_map.json file contains a sequence to frame ID mapping for the frames in the images directory. This is an optional file and is useful if the data needs to be split into N folds sequence wise. In case the data is to be split into a random 80:20 train:val split, then this file may be ignored.

Label Files

A KITTI format label file is a simple text file containing one line per object. Each line has multiple fields. Here is a description of these fields:

Num elements

Parameter name

Description

Type

Range

Example

1

Class names

The class to which the object belongs.

String

N/A

Person, car, Road_Sign

1

Truncation

How much of the object has left image boundaries.

Float

0.0, 0.1

0.0

1

Occlusion

Occlusion state [ 0 = fully visible, 1 = partly visible, 2 = largely occluded, 3 = unknown].

Integer

[0,3]

2

1

Alpha

Observation Angle of object

Float

[-pi, pi]

0.146

4

Bounding box coordinates: [xmin, ymin, xmax, ymax]

Location of the object in the image

Float(0 based index)

[0 to image width],[0 to image_height], [top_left, image_width], [bottom_right, image_height]

100 120 180 160

3

3-D dimension

Height, width, length of the object (in meters)

Float

N/A

1.65, 1.67, 3.64

3

Location

3-D object location x, y, z in camera coordinates (in meters)

Float

N/A

-0.65,1.71, 46.7

1

Rotation_y

Rotation ry around the Y-axis in camera coordinates

Float

[-pi, pi]

-1.59

The sum of the total number of elements per object is 15. Here is a sample text file:

Copy
Copied!
            

car 0.00 0 -1.58 587.01 173.33 614.12 200.12 1.65 1.67 3.64 -0.65 1.71 46.70 -1.59 cyclist 0.00 0 -2.46 665.45 160.00 717.93 217.99 1.72 0.47 1.65 2.45 1.35 22.10 -2.35 pedestrian 0.00 2 0.21 423.17 173.67 433.17 224.03 1.60 0.38 0.30 -5.87 1.63 23.11 -0.03

This indicates that in the image there are 3 objects with parameters as mentioned above. Currently, for detection the toolkit only requires the class name and bbox coordinates fields to be populated. This is because the TAO training pipe supports training only for class and bbox coordinates. The remaining fields may be set to 0. Here is a sample file for a custom annotated dataset:

Copy
Copied!
            

car 0.00 0 0.00 587.01 173.33 614.12 200.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 cyclist 0.00 0 0.00 665.45 160.00 717.93 217.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 pedestrian 0.00 0 0.00 423.17 173.67 433.17 224.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00


Sequence Mapping File

This is an optional JSON file that captures the mapping between the frames in the images directory and the names of video sequences from which these frames were extracted. This information is needed while doing an N-fold split of the dataset. This way frames from one sequence don’t repeat in other folds and one of the folds could be used for validation. Here’s an example of the json dictionary file.

Copy
Copied!
            

{ "video_sequence_name": [list of strings(frame idx)] }

Here’s an example of a kitti_seq_to_frames.json file with a sample dataset with six sequences:

Copy
Copied!
            

{ "2011_09_28_drive_0165_sync": ["003193", "003185", "002857", "001864", "003838", "007320", "003476", "007308", "000337", "004165", "006573"], "2011_09_28_drive_0191_sync": ["005724", "002529", "004136", "005746"], "2011_09_28_drive_0179_sync": ["005107", "002485", "006089", "000695"], "2011_09_26_drive_0079_sync": ["005421", "000673", "002064", "000783", "003068"], "2011_09_28_drive_0035_sync": ["005540", "002424", "004949", "004996", "003969"], "2011_09_28_drive_0117_sync": ["007150", "003797", "002554", "001509"] }


Since TAO Toolkit 3.0-22.05, all object detection models support COCO format. Using the COCO format requires data to be organized in this structure:

Copy
Copied!
            

annotation{ "id": int, "image_id": int, "category_id": int, "bbox": [x,y,width,height], "area": float, "iscrowd": 0 or 1, } image{ "id": int, "width": int, "height": int, "file_name": str, "license": int, "flickr_url": str, "coco_url": str, "date_captured": datetime, } categories[{ "id": int, "name": str, "supercategory": str, }]

An example COCO annotation file is shown below:

Copy
Copied!
            

"annotations": [{"area": 702.1057499999998,"iscrowd": 0,"image_id": 289343,"bbox": [473.07,395.93,38.65,28.67],"category_id": 18,"id": 1768}], "images": [{"license": 1,"file_name": "000000407646.jpg","coco_url": "http://images.cocodataset.org/val2017/000000407646.jpg","height": 400,"width": 500,"date_captured": "2013-11-23 03:58:53","flickr_url": "http://farm4.staticflickr.com/3110/2855627782_17b93a684e_z.jpg","id": 407646}], "categories": [{"supercategory": "person","id": 1,"name": "person"},{"supercategory": "vehicle","id": 2,"name": "bicycle"},{"supercategory": "vehicle","id": 3,"name": "car"},{"supercategory": "vehicle","id": 4,"name": "motorcycle"}]

See the COCO website for a description of the COCO format.

Important

The id in categories should start from 1.

Using the COCO format requires data to be organized in this structure:

Copy
Copied!
            

annotation{ "id": int, "image_id": int, "category_id": int, "segmentation": RLE or [polygon], "area": float, "bbox": [x,y,width,height], "iscrowd": 0 or 1, } image{ "id": int, "width": int, "height": int, "file_name": str, "license": int, "flickr_url": str, "coco_url": str, "date_captured": datetime, } categories[{ "id": int, "name": str, "supercategory": str, }]

An example COCO annotation file is shown below:

Copy
Copied!
            

"annotations": [{"segmentation": [[510.66,423.01,511.72,420.03,510.45,416.0,510.34,413.02,510.77,410.26,510.77,407.5,510.34,405.16,511.51,402.83,511.41,400.49,510.24,398.16,509.39,397.31,504.61,399.22,502.17,399.64,500.89,401.66,500.47,402.08,499.09,401.87,495.79,401.98,490.59,401.77,488.79,401.77,485.39,398.58,483.9,397.31,481.56,396.35,478.48,395.93,476.68,396.03,475.4,396.77,473.92,398.79,473.28,399.96,473.49,401.87,474.56,403.47,473.07,405.59,473.39,407.71,476.68,409.41,479.23,409.73,481.56,410.69,480.4,411.85,481.35,414.93,479.86,418.65,477.32,420.03,476.04,422.58,479.02,422.58,480.29,423.01,483.79,419.93,486.66,416.21,490.06,415.57,492.18,416.85,491.65,420.24,492.82,422.9,493.56,424.39,496.43,424.6,498.02,423.01,498.13,421.31,497.07,420.03,497.07,415.15,496.33,414.51,501.1,411.96,502.06,411.32,503.02,415.04,503.33,418.12,501.1,420.24,498.98,421.63,500.47,424.39,505.03,423.32,506.2,421.31,507.69,419.5,506.31,423.32,510.03,423.01,510.45,423.01]],"area": 702.1057499999998,"iscrowd": 0,"image_id": 289343,"bbox": [473.07,395.93,38.65,28.67],"category_id": 18,"id": 1768}], "images": [{"license": 1,"file_name": "000000407646.jpg","coco_url": "http://images.cocodataset.org/val2017/000000407646.jpg","height": 400,"width": 500,"date_captured": "2013-11-23 03:58:53","flickr_url": "http://farm4.staticflickr.com/3110/2855627782_17b93a684e_z.jpg","id": 407646}], "categories": [{"supercategory": "person","id": 1,"name": "person"},{"supercategory": "vehicle","id": 2,"name": "bicycle"},{"supercategory": "vehicle","id": 3,"name": "car"},{"supercategory": "vehicle","id": 4,"name": "motorcycle"}]

See the COCO website for a description of the COCO format.

Important

The id in categories should start from 1.

This section describes the dataset formats supported by UNet/ Segformer for loading images and masks.

Note

If you have the masks saved in COCO format, refer to :ref: sample_usage_of_the_dataset_converter_tool_unet<Dataset Converter> to convert COCO format to UNet mask image format.

Semantic Segmentation Mask Format

This section describes the format of the mask images for different types of input_image_type or input_type. Refer dataset_config_unet for more information on configuring the input_image_type for UNet. Refer dataset_config_segformer for more information on configuring the input_type for Segformer.

Color/ RGB Input Image Type

For the color/ rgb input images, each mask image is a single-channel or three-channel image with size equal to the input image. Every pixel in the mask should have an integer value that represents the segmentation class label_id, as per the mapping provided in the dataset_config_unet and dataset_config_segformer. Ensure that the value of the pixels in the mask image are within the range of the label_id values provided in the dataset_config and dataset_config_segformer.

For a reference example, refer to the _labelIds.png images format in the Cityscapes Dataset.

Grayscale Input Image Type

For grayscale input images, the mask is a single channel image with size equal to the input image. Every pixel has a value of 255 or 0, which corresponds respectively to a label_id of 1 or 0 in the dataset_config and dataset_config_segformer. For reference, refer to ISBI dataset Jupyter notebook example provided in ngcresources.

Image and Mask Loading Format

Segformer

For SegFormer, the path to images and mask folders can directly be provided in the dataset_config_segformer. Please ensure that the image and the corresponding mask names are same. The image and mask extension need not be necessarily same.

UNet

Structured Images and Masks Folders for UNet

The data folder structure for images and masks must be in the following format for UNet:

Copy
Copied!
            

/Dataset_01 /images /train 0000.png 0001.png ... ... N.png /val 0000.png 0001.png ... ... N.png /test 0000.png 0001.png ... ... N.png /masks /train 0000.png 0001.png ... ... N.png /val 0000.png 0001.png ... ... N.png

  • See the Folders based Dataset Config section for further details about configuring these image and mask folder paths in experiment spec.

  • Each image and label has the same file ID before the extension. The image-to-label correspondence is maintained using this filename. The test folder in the above directory structure is optional; any folder can be used for inference.

Image and Mask Text files for UNet

An image text file containing the paths to all the images and a mask text file containing the paths to the corresponding mask files. The image names and mask names should full abosolute unix paths.

Contents of example images text file images_source1.txt is shown below:

Copy
Copied!
            

/home/user/workspace/exports/images_final/00001.jpg /home/user/workspace/exports/images_final/00002.jpg

Contents of example corresponding mask text file labels_source1.txt is shown below. It contains the corresponding mask names:

Copy
Copied!
            

/home/user/workspace/exports/masks_final/00001.png /home/user/workspace/exports/masks_final/00002.png

  • Text file method additionally allows to specify multiple sequences.

  • These text file paths should be provided in spec file.

See Text files based Dataset Config section for further details about configuring multiple data sources using text files in dataset config.

Note

The size of the images need not necessarily be equal to the model input dimensions. The images are resized internally to model input dimensions.

A gesture recognition model should perform well on users outside the training dataset. Thus, model training requires user segregation when splitting into train, validation and test dataset. To enable this we need some unique identifier, user_id for each subject. In addition each subject might record multiple videos.

We wish to organise dataset in the following format:

Copy
Copied!
            

. |-- original dataset root |-- uid_1 |-- session_1 |-- 000000.png |-- 000001.png . . |-- xxxxxx.png |-- session_2 |-- 000000.png |-- 000001.png . . |-- xxxxxx.png |-- uid_2 |-- session_1 |-- 000000.png |-- 000001.png . . |-- xxxxxx.png |-- session_2 |-- 000000.png |-- 000001.png . . |-- xxxxxx.png |-- uid_3 |-- session_1 |-- 000000.png |-- 000001.png . . |-- xxxxxx.png

For each set we also prepare a metadata file that captures fields that can be used for dataset sampling.

Copy
Copied!
            

{ "set": "data", "users": { "uid_1": { "location": "outdoor", "illumination": "good", "class_fps": { "session_1": 30, "session_2": 30 } }, "uid_2": { "location": "indoor", "illumination": "good", "class_fps": { "session_1": 10, "session_2": 15 } }, "uid_3": { "location": "indoor", "illumination": "poor", "class_fps": { "session_1": 10 } } } }

Label Format

Each image corresponds to a subject performing a gesture. The image requires a corresponding label JSON which contains a bounding box for the hand of interest and gesture label. We follow the Label Studio format. A sample label for an image is:

Copy
Copied!
            

{ "completions": [ { "result": [ { "type": "rectanglelabels", "original_width": 320, "original_height": 240, "value": { "x": 58.1, "y": 18.3, "width": 18.8, "height": 49.5 } }, { "type": "choices", "value": { "choices": [ "Thumbs-up" ] } } ] } ], "task_path": "/workspace/tao-experiments/gesturenet/data/uid_1/session_1/image_0001.png" }

  • task_path: specifies the full path to the image.

  • completions: This is a chunk that contains the labels under results.

The bounding box and gesture class are separate entries with the following type

  • rectanglelabels: specifies the label corresponding to hand bounding box.

Parameter name

Description

Type

Range

type

The type of label

String

rectanglelabels

original_width

Width of image being labelled (in pixels)

Integer

[1, inf)

original_height

Height of image being labelled (in pixels)

Integer

[1, inf)

value[“x”]

x coordinate of top left corner of hand bounding box (as a percentage of image width)

Float

[0, 100]

value[“y”]

y coordinate of top left corner of hand bounding box (as a percentage of image height)

Float

[0, 100]

value[“width”]

Width of the hand bounding box (as a percentage of image width)

Float

[0, 100]

value[“height”]

Height of the hand bounding box (as a percentage of image height)

Float

[0, 100]

  • choices: specifies the label corresponding to gesture class.

Parameter name

Description

Type

Range

type

The type of label

String

choices

value[“choices”]

List of attributes. For GestureNet app this will be a single entry with gesture class name

List of strings

Valid gesture classes

The dataset_convert tool requires an extraction and experiment configuration spec files input. The details of the configuration files and sample usage examples are included on the Gesture Recognition page.

HeartRateNet expects directories of images in the format shown below. The images and ground truth labels are then converted to TFRecords for training.

Copy
Copied!
            

Subject_001/ ground_truth.csv image_timestamps.csv images/ 0000.bmp 0001.bmp . . N.bmp . . Subject_M/ ground_truth.csv image_timestamps.csv images/ 0000.bmp 0001.bmp . . Y.bmp

EmotionNet, FPENet, and GazeNet use the same JSON data format labeled by the NVIDIA data factory team. These apps expect data in this Json data format for training and evaluation. For EmotionNet, FPENet, and GazeNet, this data is converted to TFRecords for training. TFRecords help iterate faster through the data. Please refer to the corresponding section for the JSON data format descriptions.

Using the Json Label data format requires data to be organized in a json file with the following structure:

Copy
Copied!
            

. { "filename": "data/001_01_02_200_06.png", "class": "image", "annotations": [ { "class": "FaceBbox", "tool-version": "1.0", "Occlusion": 0, "face_outer_bboxx": 269.0082935424086, "face_outer_bboxy": 44.33839032556304, "face_outer_bboxwidth": 182.97858097042064, "face_outer_bboxheight": 276.28773076003836, "face_tight_bboxx": 269.211755426433, "face_tight_bboxy": 147.9049289218409, "face_tight_bboxwidth": 182.58110482105968, "face_tight_bboxheight": 172.5088694283426 }, { "class": "FiducialPoints", "tool-version": "1.0", "P1x": 304.8502837500011, "P1y": 217.10946645000078, "P2x": 311.0173699500011, "P2y": 237.15249660000086, . . "P26occluded": true, "P46occluded": true, . . "P68x": 419.5885050000024, "P68y": 267.6976650000015, . . "P104x": 429.6, "P104y": 189.5, }, { "class": "eyes" "tool-version": "1.0", "l_eyex": 389.1221901922325, "l_eyey": 197.94528259092206, "r_eyex": 633.489814294182, "r_eyey": 10.52527209626886, "l_status": "open", "r_status": "occluded", } ] }

Here’s a description of the structure:

  • filename field: specifies the path to the images to train on.

  • class field: category of the labels for the respective section.

  • annotation field: annotation chunk.

There are three supported chunk in the annotation including FaceBbox, FiducialPoints, and eyes.

  • FaceBox chunk: This is a chunk that describe Face Bounding Box labeling information.

Parameter name

Description

Type

Range

Example

class

The class for the annotation chunk

String

N/A

FaceBbox

tool-version

Version of the labeling tool for this chunk

Float

N/A

1.0

Occlusion

Occlusion state [ 0 = not occlused, 1 = occluded ]

Integer

0 or 1

0

face_outer_bboxx

x coordinate of top left corner of outer face bounding box

Float

[0, image_width]

269.05

face_outer_bboxy

y coordinate of top left corner of outer face bounding box

Float

[0, image_height]

44.33

face_outer_bboxwidth

Width of the outer face bounding box

Float

[0, image_width]

182.97

face_outer_bboxheight

Height of the outer face bounding box

Float

[0, image_height]

276.28

face_tight_bboxx

x coordinate of top left corner of tight face bounding box

Float

[0, image_width]

269.21

face_tight_bboxy

y coordinate of top left corner of outer face bounding box

Float

[0, image_height]

147.90

face_tight_bboxwidth

Width of the outer face bounding box

Float

[0, image_width]

182.58

face_tight_bboxheight

Height of the outer face bounding box

Float

[0, image_height]

172.50

  • FiducialPoint chunk: This is a chunk that describes Fiducial Point Labeling information.

Parameter name

Description

Type

Range

Example

class

The class for the annotation chunk

String

N/A

FaceBbox

tool-version

Version of the labeling tool for this chunk

Float

N/A

1.0

Occlusion

Occlusion status [ 0 = not occlused, 1 = occluded ]

Integer

0 or 1

0

Pix

x coordinate of the ith landmarks point

Float

[0, image_width]

304.85

Piy

y coordinate of the ith landmarks point

Float

[0, image_height]

217.10

Pioccluded

Width of the outer face bounding box

String

N/A

true

  • eyes chunk: This is a chunk that describes eyes labeling information. This chunk is not required.

Parameter name

Description

Type

Range

Example

class

The class for the annotation chunk

String

N/A

FaceBbox

tool-version

Version of the labeling tool for this chunk

Float

N/A

1.0

l_eyex

x coordinate of left eye center

Float

[0, image_width]

389.12

l_eyey

y coordinate of left eye center

Float

[0, image_height]

197.94

r_eyex

x coordinate of right eye center

Float

[0, image_width]

633.48

r_eyey

y coordinate of right eye center

Float

[0, image_height]

182.97

l_status

Status of the left eye

String

open/close/barely open/half open/occluded

open

r_status

Status of the right eye

String

open/close/barely open/half open/occluded

occluded

Here’s an example of a json file with a sample dataset with two image frames:

Copy
Copied!
            

[ { "filename": "data/001_01_02_200_06.png", "class": "image", "annotations": [ { "face_outer_bboxy": 44.33839032556304, "face_outer_bboxx": 269.0082935424086, "face_tight_bboxx": 269.211755426433, "face_tight_bboxy": 147.9049289218409, "tool-version": "1.0", "face_tight_bboxwidth": 182.58110482105968, "face_tight_bboxheight": 172.5088694283426, "face_outer_bboxwidth": 182.97858097042064, "Occlusionx": 0, "class": "FaceBbox", "face_outer_bboxheight": 276.28773076003836 }, { "P91x": 395.3500000000004, "P91y": 196.6500000000002, "P74occluded": true, "P28x": 436.44144340908053, "P28y": 174.67157210032852, "P52y": 252.53100000000143, "P52x": 428.9925000000024, "P32y": 236.48449500000103, "P32x": 416.6063550000018, "P44x": 427.65443026467267, "P44y": 186.9615161604129, "P99x": 425.75, "P36occluded": true, "P75x": 428.85, "P75y": 190.95000000000002, "P20x": 389.46879000000166, "P20y": 178.13376000000076, "P8y": 313.8318038340011, "P8x": 407.70466707150143, "P81y": 192.2500000000002, "P94x": 427.70000000000005, "P81x": 393.5500000000004, "P12y": 268.179948238501, "P12x": 408.69280247400155, "P65y": 260.04348000000147, "P65x": 429.0319800000024, "P84x": 396.8500000000004, "P84y": 194.4500000000002, "P93occluded": true, "P46occluded": true, "P43y": 193.31428917697824, "P43x": 421.12354211680173, "P14occluded": true, "P92y": 187.5, "P54occluded": true, "P53x": 433.50450000000245, "P53y": 251.9670000000014, "P45occluded": true, "P33x": 426.3480450000019, "P33y": 238.67140500000104, "P60x": 413.82301500000233, "P100occluded": true, "P60y": 272.07148500000153, "P23y": 174.7903155211989, "P23x": 428.12940394815394, "P90y": 194.9000000000002, "P13x": 399.2067026100015, "P13y": 257.903340052501, "P7x": 388.1395861020014, "P7y": 304.93858521150105, "P61y": 262.1309850000015, "P104x": 429.6, "P104y": 189.5, "P83y": 193.2500000000002, "P83x": 395.0000000000004, "P61x": 404.5783500000023, "P50y": 254.6756100000014, "P50x": 414.2206350000023, "P100x": 424.8, "P100y": 191.3, "P34y": 240.46069500000107, "P34x": 435.9903300000019, "P18y": 188.2730700000008, "P18x": 366.50623500000154, "P25occluded": true, "P102occluded": true, "P46x": 436.0852131464696, "P46y": 191.82999641609848, "P58y": 275.0536350000016, "P58x": 429.2307900000024, "P77x": 306.5418228495726, "P77y": 258.61884245799524, "P97occluded": true, "P99y": 192.9, "P10y": 293.87146870350114, "P10x": 434.97720418050164, "P48occluded": true, "P26x": 436.0258414360342, "P26y": 171.99984513074497, "version": "v1", "P27occluded": true, "P86x": 397.8000000000004, "P86y": 198.45000000000022, "P73occluded": true, "P98occluded": true, "P2y": 237.15249660000086, "P90x": 393.3500000000004, "P29y": 203.3826300000009, "P29x": 433.6046100000019, "P101y": 188.85000000000002, "P101x": 425.65000000000003, "P51x": 423.6641100000023, "P51y": 252.5881050000014, "P35x": 436.78557000000194, "P35y": 239.26783500000104, "P66x": 433.70401500000247, "P66y": 268.0952850000015, "P19x": 378.4348350000016, "P19y": 181.61293500000076, "P98y": 193.45000000000002, "P98x": 427.85, "P45y": 187.0802595812833, "P45x": 433.2353710455805, "P21y": 176.44387500000076, "P21x": 398.1170250000017, "P59x": 422.1730350000024, "P59y": 274.25839500000154, "P9x": 431.0246625705015, "P9y": 312.25078719000106, "P17occluded": true, "P11x": 422.7243251895016, "P11y": 281.81621679300105, "P70y": 195.95000000000002, "P79occluded": true, "P95occluded": true, "P70x": 395.20000000000005, "P1x": 304.8502837500011, "P13occluded": true, "P85y": 196.6500000000002, "P85x": 398.1000000000004, "P69y": 196.95000000000002, "P24x": 433.0572559142747, "P36y": 236.88211500000105, "P36x": 427.5409050000019, "P94occluded": true, "P104occluded": true, "P47occluded": true, "P40x": 401.35650000000186, "P40y": 197.40000000000092, "P71x": 396.40000000000003, "P71y": 196.8, "P65occluded": true, "P26occluded": true, "P56y": 273.06553500000155, "P56x": 433.0081800000024, "P16occluded": true, "P89y": 196.2500000000002, "P89x": 392.4500000000004, "P48x": 428.54500592120047, "P48y": 195.45167075264504, "P16y": 216.4016531475008, "P16x": 360.47179483200136, "P15occluded": true, "P24y": 170.63429579073562, "P78x": 276.3975906000002, "class": "FiducialPoints", "P74y": 190.10000000000002, "P4y": 270.1562190435009, "P4x": 329.2467161130011, "P96y": 191.10000000000002, "P74x": 427.85, "P103y": 195.00000000000003, "P103x": 396.4500000000001, "P80x": 330.41417158035716, "P80y": 178.5832276794402, "P37x": 381.05250000000177, "P37y": 200.64300000000094, "P47y": 195.09544049003392, "P47x": 433.47285788732125, "P64x": 432.80937000000245, "P64y": 255.47085000000143, "P76y": 191.60000000000002, "P57y": 271.77327000000156, "P99occluded": true, "P43occluded": true, "P88x": 392.8500000000004, "P88y": 198.45000000000022, "P17x": 335.9660368500013, "P17y": 206.7179262030008, "P96x": 431.05, "P67y": 268.3935000000015, "P27y": 173.42476618118954, "P27x": 436.38207169864535, "P87y": 199.45000000000022, "P87x": 395.1000000000004, "P3x": 316.76397300000116, "P67x": 426.8450700000024, "P96occluded": true, "P12occluded": true, "P97x": 430.35, "P97y": 193.05, "P101occluded": true, "P55occluded": true, "P93x": 429.05, "P93y": 195.4, "P42x": 388.6665000000018, "P42y": 200.64300000000094, "P79y": 238.89320075909347, "P54y": 252.24900000000142, "P54x": 431.5305000000024, "P73x": 427.05, "P73y": 191, "P68y": 267.6976650000015, "P30y": 214.61539500000094, "P30x": 440.86117500000194, "P14y": 243.47656317600092, "P14x": 384.18704449200146, "P63y": 254.87442000000144, "P76occluded": true, "P22x": 406.8646650000017, "P22y": 176.94090000000077, "P28occluded": true, "P6y": 296.24299366950106, "P6x": 367.5863697300013, "P92x": 428.85, "P38y": 193.3815000000009, "P38x": 388.5255000000018, "P94y": 188.5, "P72y": 197.70000000000002, "P72x": 395.65000000000003, "P78y": 210.5218971000002, "P63x": 427.8391200000024, "P35occluded": true, "P82x": 393.8000000000004, "P82y": 200.95000000000022, "P11occluded": true, "tool-version": "1.0", "P41y": 200.99550000000093, "P41x": 396.5625000000018, "P56occluded": true, "P55x": 425.0508679558401, "P55y": 259.9172483306748, "P31x": 449.410005000002, "P31y": 225.351135000001, "P1y": 217.10946645000078, "P75occluded": true, "P62x": 420.38374500000236, "P62y": 256.06728000000146, "P15x": 373.5151821450014, "P15y": 228.45690505800087, "P49y": 261.4140000000014, "P49x": 400.0875000000022, "P25y": 170.87178263247637, "P25x": 435.25400920037674, "P2x": 311.0173699500011, "P80occluded": true, "P3y": 251.86940685000093, "P39x": 397.33800000000184, "P39y": 192.1830000000009, "P69x": 394.6, "P5x": 347.3103508088991, "P5y": 287.4697160411496, "P95x": 430, "P95y": 189.25, "P79x": 368.8999131564783, "P57x": 434.7974700000025, "P102x": 428.1, "P102y": 190.85000000000002, "P76x": 428.25 }, { "l_eyex": 389.1221901922325, "l_eyey": 197.94528259092206, "tool-version": "1.0", "l_status": "open", "r_status": "occluded", "r_eyex": 633.489814294182, "r_eyey": 10.52527209626886, "class": "eyes" } ] }, { "filename": "data/001_03_01_130_05.png", "class": "image", "annotations": [ { "face_outer_bboxy": 36.21548211860577, "face_outer_bboxx": 259.54428851667467, "face_tight_bboxx": 265.58020220310897, "face_tight_bboxy": 116.19133846386018, "tool-version": "1.0", "face_tight_bboxwidth": 191.64025954428882, "face_tight_bboxheight": 192.64624515869457, "face_outer_bboxwidth": 198.68215884512887, "Occlusionx": 0, "class": "FaceBbox", "face_outer_bboxheight": 273.62808711835464 }, { "P91x": 283.35, "P91y": 179.55, "P28x": 304.14947850000084, "P28y": 176.3226009000005, "P5occluded": true, "P52y": 244.28250000000094, "P52x": 305.0535000000012, "P32y": 220.38088500000066, "P32x": 289.76557500000087, "P44x": 334.8750000000012, "P44y": 168.63600000000062, "P99x": 340.20000000000005, "P99y": 174.75, "P75x": 343.90000000000003, "P75y": 171.70000000000002, "P20x": 269.9839800000006, "P20y": 158.94859500000035, "P8y": 299.437842994699, "P8x": 301.7845345542186, "P94x": 342.70000000000005, "P12y": 272.68555921617576, "P12x": 389.08146056834715, "P65y": 249.500000000001, "P65x": 321.9500000000013, "P84x": 285.8, "P84y": 175.5, "P43y": 176.03850000000065, "P43x": 329.9400000000012, "P68x": 302.05, "P68y": 252.55, "P92y": 165.70000000000002, "P92x": 343.40000000000003, "P53x": 311.11650000000117, "P53y": 241.18050000000093, "P33x": 295.53106500000086, "P33y": 224.95351500000066, "P60x": 297.5100000000011, "P60y": 258.382500000001, "P23y": 149.55274915302633, "P23x": 325.5457633496816, "P90y": 177.15, "P13x": 406.681647264744, "P13y": 256.7280566114426, "P7x": 292.6324374720922, "P90x": 280.25, "P58x": 309.7065000000012, "P61y": 253.51800000000097, "P104x": 346.0500000000002, "P104y": 171.35000000000008, "P83y": 174.70000000000002, "P83x": 282.15000000000003, "P61x": 296.31150000000116, "P50y": 249.21750000000097, "P50x": 294.19650000000115, "P100x": 339.45000000000005, "P100y": 171.75, "P34y": 224.85411000000067, "P34x": 300.8989350000009, "P18y": 170.97660000000036, "P18x": 268.59231000000057, "P46x": 357.7170000000013, "P46y": 172.86600000000064, "P58y": 258.664500000001, "P4occluded": true, "P77x": 300.22496910000007, "P77y": 221.17413690000006, "tool-version": "1.0", "P10y": 298.73383552684317, "P10x": 341.67829106605154, "P26x": 361.5470170921228, "P26y": 148.3723801778643, "version": "v1", "P86x": 286.6, "P86y": 181.9, "P2y": 204.16216567820388, "P2x": 300.6111887744588, "P29y": 189.63790065000055, "P29x": 301.90690170000084, "P101y": 168.8, "P101x": 340.40000000000003, "P51x": 298.49700000000115, "P51y": 243.57750000000092, "P35x": 310.83943500000095, "P35y": 223.36303500000068, "P66x": 313.85, "P66y": 251.05, "P19x": 267.8964750000006, "P19y": 165.11170500000037, "P98y": 176.55, "P98x": 342.85, "P45y": 165.8865000000006, "P45x": 347.2125000000013, "P21y": 156.26466000000033, "P21x": 276.4453050000006, "P59x": 303.2910000000012, "P59y": 258.664500000001, "P9x": 316.33402222324, "P9y": 303.89655695778623, "P17occluded": true, "P11x": 366.0838832850552, "P11y": 286.0617011054374, "P70y": 178.35000000000002, "P70x": 283.15000000000003, "P1x": 307.6512634530176, "P1y": 189.14333969727852, "P85y": 178.60000000000002, "P85x": 287.5, "P69y": 179.3, "P69x": 282.65000000000003, "P36y": 219.78445500000066, "P36x": 326.446020000001, "P77occluded": true, "P81y": 173.25, "P81x": 281.95, "P40x": 298.00350000000105, "P40y": 178.85850000000062, "P71x": 283.95, "P71y": 178.9, "P56y": 254.505000000001, "P56x": 323.24250000000126, "P7y": 284.1843478578217, "P89y": 179.85000000000002, "P89x": 279.90000000000003, "P48x": 338.96400000000125, "P48y": 177.51900000000066, "P16y": 205.1008423020117, "P16x": 420.9964657778135, "P24x": 338.1757113839151, "P24y": 146.9559374076699, "class": "FiducialPoints", "P74y": 170.85000000000002, "P4y": 234.6691559519585, "P4x": 290.9897533804285, "P96y": 172, "P74x": 342.95000000000005, "P3occluded": true, "P78occluded": true, "P103y": 179.4, "P103x": 285.55, "P80x": 444.78450000000055, "P80y": 173.78250000000023, "P37x": 275.44350000000094, "P37y": 182.80650000000063, "P47y": 176.95500000000064, "P47x": 350.5965000000013, "P64x": 313.45000000000124, "P64y": 249.95000000000098, "P76y": 172.65, "P57y": 257.113500000001, "P6occluded": true, "P88x": 281.1, "P88y": 182.60000000000002, "P17x": 420.2924583099576, "P17y": 187.96999391751874, "P96x": 347.70000000000005, "P67y": 252.45000000000002, "P27y": 153.68404056609333, "P27x": 370.04567371328926, "P87y": 183.5, "P87x": 283.95, "P3x": 295.44846734351574, "P67x": 307.95000000000005, "P2occluded": true, "P97x": 346.25, "P97y": 175.35000000000002, "P93x": 344.35, "P93y": 177.8, "P42x": 280.30800000000096, "P42y": 185.34450000000064, "P54y": 243.78900000000093, "P54x": 321.0570000000012, "P73x": 342.5, "P73y": 171.95000000000002, "P30y": 201.83191200000059, "P30x": 298.26271440000085, "P14y": 239.83187738290155, "P14x": 416.77242097067824, "P63y": 251.20000000000098, "P63x": 307.7500000000012, "P22x": 284.2983000000006, "P22y": 157.95454500000034, "P1occluded": true, "P6y": 270.10419850070423, "P6x": 289.34706928876477, "P38y": 176.17950000000062, "P38x": 277.06500000000096, "P94y": 166.65, "P72y": 180.35000000000002, "P72x": 283.65000000000003, "P78y": 176.32260090000005, "P78x": 318.5860666500001, "P82x": 285.6, "P82y": 185.35000000000002, "P32occluded": true, "P41y": 183.37050000000062, "P41x": 290.460000000001, "P55x": 334.9455000000013, "P55y": 251.89650000000097, "P31x": 295.59965445000086, "P31y": 213.04479600000062, "P79y": 232.39037411700184, "P62x": 301.6000000000012, "P62y": 251.30000000000098, "P15x": 420.05778915400566, "P15y": 222.93569815436055, "P49y": 256.690500000001, "P49x": 291.65850000000114, "P25y": 145.8936053300241, "P25x": 350.45154872559993, "P3y": 218.94632250317727, "P39x": 286.30050000000097, "P39y": 173.50050000000059, "P5x": 288.8777309768609, "P5y": 253.44268842811516, "P95x": 346.6, "P95y": 168.25, "P79x": 431.0439612134159, "P57x": 315.4875000000012, "P102x": 343.1, "P102y": 172, "P76x": 343.20000000000005 }, { "l_eyex": 289.90000000000003, "l_eyey": 179.60000000000002, "tool-version": "1.0", "l_status": "open", "r_status": "open", "r_eyex": 337.4000000000001, "r_eyey": 173.35000000000005, "class": "eyes" } ] }

Using the COCO format requires data to be organized in this structure:

Copy
Copied!
            

|--dataset root |-- train2017 |-- 000000001000.jpg |-- 000000001001.jpg . . |-- xxxxxxxxxxxx.jpg |-- val2017 |-- 000000002000.jpg |-- 000000002001.jpg . . |-- xxxxxxxxxxxx.jpg |-- annotations |-- person_keypoints_train2017.json |-- person_keypoints_val2017.json

As long as you have a dataset root, and the filenames are adjusted accordingly in the images->filename field in annotations, you can choose to have a nested directory structure for the train and test images.

Label Files

This section outlines the COCO annotations dataset format that the data must be in for BodyposeNet. Although COCO annotations have more fields, only the attributes that are needed by BodyposeNet are mentioned here. You may use the exact same format as COCO. The dataset should use the following overall structure (in a .json file):

Copy
Copied!
            

"images": [ { "file_name": "000000001000.jpg", "height": 480, "width": 640, "id": 1000 }, { "file_name": "000000580197.jpg", "height": 480, "width": 640, "id": 580197 }, ... ], "annotations": [ { "segmentation": [[162.46,152.13,150.73,...173.92,156.23]], "num_keypoints": 17, "area": 8720.28915, "iscrowd": 0, "keypoints": [162,174,2,...,149,352,2], "image_id": 1000, "bbox": [115.16,152.13,83.23,228.41], "category_id": 1, "id": 1234574 }, ... ], "categories": [ { "supercategory": "person", "id": 1, "name": "person", "keypoints": [ "nose","left_eye","right_eye","left_ear","right_ear", "left_shoulder","right_shoulder","left_elbow","right_elbow", "left_wrist","right_wrist","left_hip","right_hip", "left_knee","right_knee","left_ankle","right_ankle" ], "skeleton": [ [16,14],[14,12],[17,15],[15,13],[12,13],[6,12],[7,13],[6,7], [6,8],[7,9],[8,10],[9,11],[2,3],[1,2],[1,3],[2,4],[3,5],[4,6],[5,7] ] } ]

  • The images section contains the complete list of images in the dataset with some metadata.

Note

Image IDs need to be unique among other images.

Parameter name

Description

Type

Range

file_name

The path to the image

String

N/A

height

The height of the image

Integer

N/A

width

The width of the image

Float

N/A

id

The unique ID of the image

Integer

N/A

  • The annotations section contains the labels for the images. Each entity is one annotation, and each image can have multiple annotations.

Parameter name

Description

Type

Range

segmentation

A list of polygons, which has a list of vertices for a given person/group.

List

N/A

num_keypoints

The number of keypoints that are labeled

Integer

[0, total_keypoints]

area

The area of the segmentation/bbox

Float

N/A

iscrowd

If 1, indicates that the annotation mask is for multiple people

Integer

[0, 1]

keypoints

A list of keypoints with the following format: [x1, y1, v1, x2, y2, v2 ...], where x and y are pixel locations, and v is the visibility/occlusion flag.

List

N/A

bbox

The bbox of the object/person

List

N/A

image_id

The unique ID of the associated image

Integer

N/A

category_id

The object category (always 1 for person)

Integer

1

id

The unique ID of the annotation

Integer

N/A

  • The COCO dataset follows the following occlusion flag labeling format: [visible: 2, occluded: 1, not_labeled: 0]

  • The categories section contains the keypoint convention that is followed in the dataset

Parameter name

Description

Type

Range

supercategory

The supercategory

String

person

id

The ID of the category

Integer

1

name

The name of the category

String

person

keypoints

The keypoint names and ordering convention as used in labeling

List

N/A

skeleton

A list of skeleton edges with the following format: [[j1, j2], [j2, j3] ...], where j is the keypoint/joint index.

List

N/A

For more details, please refer to the COCO keypoint annotations file and COCO Keypoint Detection Task.

© Copyright 2023, NVIDIA.. Last updated on Aug 2, 2023.