Data Annotation Format

This page describes the dataset formats for computer-vision apps supported by TAO Toolkit.

Image Classification Format

Image classification expects a directory of images with the following structure, where each class has its own directory with the class name. The naming convention for train/val/test can be different because the path of each set is individually specified in the spec file. See the Specification File for Classification section for more information.

Copy
Copied!

            
            |--dataset_root:
    |--train
        |--audi:
            |--1.jpg
            |--2.jpg
        |--bmw:
            |--01.jpg
            |--02.jpg
    |--val
        |--audi:
            |--3.jpg
            |--4.jpg
        |--bmw:
            |--03.jpg
            |--04.jpg
    |--test
        |--audi:
            |--5.jpg
            |--6.jpg
        |--bmw:
            |--05.jpg
            |--06.jpg

Optical Inspection Format

Optical Inspection expects directories of images and CSV files in the dataset root directory. The image directory consists of golden images (non-defective reference images) and test images to compare with the golden images for PCB defect classification.

Copy
Copied!

            
            |--dataset_root:
    |--images
         |--input:
            |--C1.jpg
            |--C2.jpg
         |--golden:
            |--C1.jpg
            |--C2.jpg
         |--input1:
            |--C1.jpg
            |--C3.jpg
         |--golden1:
            |--C1.jpg
            |--C3.jpg
    |--labels
        |-- train.csv
        |-- validation.csv

Here’s a description of the structure:

The images directory contains the following:
- input: Contains input images to be compared with golden images.
- golden: Contains golden reference images.
The labels directory contains the CSV files for pair-wise image input to the SiameseOI model with corresponding class labels, as described in the Label Files section below.

Label Files

A SiameseOI/VisualChangeNet-Classification label file is a CSV file containing the following fields:

`input_path`	`golden_path`	`label`	`object_name`
`/path/to/input/image/directory`	`/path/to/golden/image/directory`	The class to which the object belongs	`/component/name`

input_path: The path to the directory containing input compare image.
golden_path: The path to the directory containing corresponding golden reference image.
label: The labels for the pair-wise images (Use PASS for non-defective components, and any other specific defect type label for defective components).
object_name: The name of the component to be compared. The object name is the same for input and golden images and represents the image name without the file extension.
- For each object_name, TAO supports combining multiple LED intensities, camera angles, or different sensory inputs for each of the input and golden images to be compared within the SiameseOI/VisualChangeNet-Classification models. For more details, refer to the Input Mapping section below.

Here is a sample label file corresponding to the sample directory structure as describe in the Optical Inspection Format section:

`input_path`	`golden_path`	`label`	`object_name`
/dataset_root/images/input/	/dataset_root/images/golden/	PASS	C1
/dataset_root/images/input/	/dataset_root/images/golden/	PASS	C2
/dataset_root/images/input1/	/dataset_root/images/golden1/	MISSING	C1
/dataset_root/images/input1/	/dataset_root/images/golden1/	PASS	C3

Note

In the label file, ensure that non-defective samples are consistently labeled as PASS, while defective samples can be assigned any specific defect type label. The model is designed to treat all defects collectively and train for binary defect classification.

Input Mapping

For comparison within the Siamese Network, SiameseOI and VisualChangeNet-Classification models support combining several lighting conditions (1…N) for each component specified under object_name for both input and golden images. The following concat_type modes are supported:

linear: Linear concat (1 x N)
grid: Grid concat (M x N)

The SiameseOI/VisualChangeNet-Classification dataloader appends the name of each lighting condition, as specified in the experiment spec under input_map, to each of the components specified using object_name in the CSV file. This is done for both input and golden images. Here is an example of the dataset experiment spec changes for combining four input lighting conditions as a 1x4 linear grid for each component inside object_name. The dataloader appends each of the 4 lighting condition specified under input_map to each object_name to get the full image paths. These are then merged as a 1x4 grid for both the input and the golden.

Copy
Copied!

            
            dataset:
    num_input: 4
    concat_type: linear
    input_map:
        LowAngleLight: 0
        SolderLight: 1
        UniformLight: 2
        WhiteLight: 3

The dataset also supports a single lighting condition per component specified under object_name in the CSV file as input to the SiameseOI/VisualChangeNet-Classification Model. In this case, the SiameseOI/VisualChangeNet-Classification dataloader does not append anything to the object_name. Here is an example of the dataset experiment spec changes for a single lighting condition, where object_name represents the image name:

Copy
Copied!

            
            dataset:
    num_inputs: 1
    input_map: null

Change Detection (Segmentation) Format

VisualChangeNet-Segmentation expects directories of images and mask files in the dataset root directory. The image directories consist of golden image directory (pre-change images) and test image directory (post-change image) to compare against the change mask images with pixel level change masks.

Copy
Copied!

            
            |--dataset_root:
    |--A
         |--image1.jpg
         |--image2.jpg
    |--B
         |--image1.jpg
         |--image2.jpg
    |--label
         |--image1.jpg
         |--image2.jpg
    |--list
        |-- train.txt
        |-- val.txt
        |-- test.txt
        |-- predict.txt

Here’s a description of the structure:

The dataset_root directory contains the following:
- A: Contains post-change test images.
- B: Contains pre-change golden reference images.
- label: Contains ground truth segmentation change masks.
- list: Contains .txt files for each dataset split, as described in the List Files section below.

List Files

VisualChangeNet-Segmentation dataloader expects the label directory to contain .txt files for each of the dataset split [train, validation, test, inference]. A VisualChangeNet-Segmentation label file is a simple .txt file containing all file names for the particular split.

`image_names`
`file_name.png`

image_names: The names of images. Image names should be the same for test images and their corresponding reference and mask images.

Here is a sample label file corresponding to the sample directory structure as describe in the Change Detection (Segmentation) Format section.

`image_names`
`image1.png`
`image3.png`
`image2.png`

Note

To map them correctly, each test image (inside directory A) must have a reference image (inside directory B) and a segmentation change map (inside directory label) with the same name for the dataloader.

CenterPose Format

CenterPose expects directories of images and JSON files in the dataset root directory. CenterPose is a category-level object pose estimation method, which operates the training and evaluation on one object category.

The training directory consist of the images and its related JSON file, which are using the same file name. The testing/evaluation directory also consist of the images and its related JSON file, which use the ground truth to calculate the accuracy. The inference directory can only involve inference images without JSON file. The calibration information needs to be provided in the .yaml file.

Copy
Copied!

            
            |--dataset_root_category:
    |--train
         |--image1.jpg
         |--image1.json
         |--image2.jpg
         |--image2.json
    |--test/val
         |--image1.jpg
         |--image1.json
         |--image2.jpg
         |--image2.json
    |--inference
         |--image1.jpg
         |--image2.jpg

The following is a description of the structure:

The dataset_root_category directory contains the following folders for the specific category:
- train: Contains the training images and the JSON files.
- test/val: Contains the testing/validation images and the JSON files.
- inference: Contains inference images.

List Files

CenterPose dataloader expects the .json files for each of the dataset split [train, validation, test], which provides the calibration information and the ground truth.

Here is a sample directory structure for the dataset split [train, validation, test].

`image_names`
`file_name.png`
`file_name.json`

image_names: The names of images and JSON files. Image names must be the same as their corresponding ground truth JSON images.

For the inference dataset split, the CenterPose dataloader only expects the inference images, and the intrinsic matrix is loaded from the configure .yaml file.

The following is a sample directory structure for the dataset split [inference]:

`image_names`
`image1.png`
`image2.png`

Note

To correctly calculate the accuracy, make sure the calibration information is provided and was verified.

Image Classification Format PyTorch

Image classification expects a directory of images with the following structure, where each class has its own directory with a class name. The naming convention for train/val/test can be different because the path of each set is individually specified in the spec file. See the Specification File for Classification PyT section for more information. In the following example, the respective paths to train, evaluate, and test are in the data_prefix:

Copy
Copied!

            
            |--data_root:
    |--train
        |--audi:
            |--1.jpg
            |--2.jpg
        |--bmw:
            |--01.jpg
            |--02.jpg
    |--val
        |--audi:
            |--3.jpg
            |--4.jpg
        |--bmw:
            |--03.jpg
            |--04.jpg
    |--test
        |--audi:
            |--5.jpg
            |--6.jpg
        |--bmw:
            |--05.jpg
            |--06.jpg

Optionally, if the images are not in the above structure, additional annotation file can be provided. For an image structure like this:

Copy
Copied!

            
            train/
├── folder_1
│   ├── xxx.png
│   ├── xxy.png
│   └── ...
├── 123.png
├── nsdf3.png
└── ...

An annotation file records all sample paths and the corresponding category index. The first column is the image path relative to the folder (in this example, “train”), and the second column is the category index:

Copy
Copied!

            
            folder_1/xxx.png 0
folder_1/xxy.png 1
123.png 1
nsdf3.png 2

Note

For more details, see the MMPretrain dataset structure documentation.

Object Detection – KITTI Format

Using the KITTI format requires data to be organized in this structure:

Copy
Copied!

            
            .
|--dataset root
  |-- images
      |-- 000000.jpg
      |-- 000001.jpg
            .
            .
      |-- xxxxxx.jpg
  |-- labels
      |-- 000000.txt
      |-- 000001.txt
            .
            .
      |-- xxxxxx.txt
  |-- kitti_seq_to_map.json

Here’s a description of the structure:

The images directory contains the images to train on.
The labels directory contains the labels to the corresponding images. Details of this file are included in the Label Files section.

Note

The images and labels have the same file IDs before the extension. The image to label correspondence is maintained using this file name.
The kitti_seq_to_map.json file contains a sequence to frame ID mapping for the frames in the images directory. This is an optional file and is useful if the data needs to be split into N folds sequence wise. In case the data is to be split into a random 80:20 train:val split, then this file may be ignored.

Note

For DetectNet_v2, the train tool does not support training on images of multiple resolutions, or resizing images during training. All of the images must be resized offline to the final training size and the corresponding bounding boxes must be scaled accordingly. Online resizing is supported for other detection model architectures.

Label Files

A KITTI format label file is a text file containing one line per object. Each line has multiple fields. Here is a description of these fields:

Num elements	Parameter name	Description	Type	Range	Example
1	Class names	The class to which the object belongs.	String	N/A	Person, car, Road_Sign
1	Truncation	How much of the object has left image boundaries.	Float	0.0, 0.1	0.0
1	Occlusion	Occlusion state [ 0 = fully visible, 1 = partly visible, 2 = largely occluded, 3 = unknown].	Integer	[0,3]	2
1	Alpha	Observation Angle of object	Float	[-pi, pi]	0.146
4	Bounding box coordinates: [xmin, ymin, xmax, ymax]	Location of the object in the image	Float(0 based index)	[0 to image width],[0 to image_height], [top_left, image_width], [bottom_right, image_height]	100 120 180 160
3	3-D dimension	Height, width, length of the object (in meters)	Float	N/A	1.65, 1.67, 3.64
3	Location	3-D object location x, y, z in camera coordinates (in meters)	Float	N/A	-0.65,1.71, 46.7
1	Rotation_y	Rotation ry around the Y-axis in camera coordinates	Float	[-pi, pi]	-1.59

The sum of the total number of elements per object is 15. Here is a sample text file:

Copy
Copied!

            
            car 0.00 0 -1.58 587.01 173.33 614.12 200.12 1.65 1.67 3.64 -0.65 1.71 46.70 -1.59
cyclist 0.00 0 -2.46 665.45 160.00 717.93 217.99 1.72 0.47 1.65 2.45 1.35 22.10 -2.35
pedestrian 0.00 2 0.21 423.17 173.67 433.17 224.03 1.60 0.38 0.30 -5.87 1.63 23.11 -0.03

This indicates that in the image there are 3 objects with parameters as mentioned above. For detection the Toolkit only requires the class name and bbox coordinates fields to be populated. This is because the TAO training pipe supports training only for class and bbox coordinates. The remaining fields may be set to 0. Here is a sample file for a custom annotated dataset:

Copy
Copied!

            
            car 0.00 0 0.00 587.01 173.33 614.12 200.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00
cyclist 0.00 0 0.00 665.45 160.00 717.93 217.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00
pedestrian 0.00 0 0.00 423.17 173.67 433.17 224.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Sequence Mapping File

This is an optional JSON file that captures the mapping between the frames in the images directory and the names of video sequences from which these frames were extracted. This information is needed while doing an N-fold split of the dataset. This way frames from one sequence don’t repeat in other folds and one of the folds could be used for validation. Here’s an example of the JSON dictionary file.

Copy
Copied!

            
            {
  "video_sequence_name": [list of strings(frame idx)]
}

Here’s an example of a kitti_seq_to_frames.json file with a sample dataset with six sequences:

Copy
Copied!

            
            {
  "2011_09_28_drive_0165_sync": ["003193", "003185", "002857", "001864", "003838",
  "007320", "003476", "007308", "000337", "004165", "006573"],
  "2011_09_28_drive_0191_sync": ["005724", "002529", "004136", "005746"],
  "2011_09_28_drive_0179_sync": ["005107", "002485", "006089", "000695"],
  "2011_09_26_drive_0079_sync": ["005421", "000673", "002064", "000783", "003068"],
  "2011_09_28_drive_0035_sync": ["005540", "002424", "004949", "004996", "003969"],
  "2011_09_28_drive_0117_sync": ["007150", "003797", "002554", "001509"]
}

Object Detection – COCO Format

Since TAO Toolkit 3.0-22.05, all object detection models support COCO format. Using the COCO format requires data to be organized in this structure:

Copy
Copied!

            
            annotation{
"id": int,
"image_id": int,
"category_id": int,
"bbox": [x,y,width,height],
"area": float,
"iscrowd": 0 or 1,
}

image{
"id": int,
"width": int,
"height": int,
"file_name": str,
"license": int,
"flickr_url": str,
"coco_url": str,
"date_captured": datetime,
}

categories[{
"id": int,
"name": str,
"supercategory": str,
}]

An example COCO annotation file is shown below:

Copy
Copied!

            
            "annotations": [{"area": 702.1057499999998,"iscrowd": 0,"image_id": 289343,"bbox": [473.07,395.93,38.65,28.67],"category_id": 18,"id": 1768}],
"images": [{"license": 1,"file_name": "000000407646.jpg","coco_url": "http://images.cocodataset.org/val2017/000000407646.jpg","height": 400,"width": 500,"date_captured": "2013-11-23 03:58:53","flickr_url": "http://farm4.staticflickr.com/3110/2855627782_17b93a684e_z.jpg","id": 407646}],
"categories": [{"supercategory": "person","id": 1,"name": "person"},{"supercategory": "vehicle","id": 2,"name": "bicycle"},{"supercategory": "vehicle","id": 3,"name": "car"},{"supercategory": "vehicle","id": 4,"name": "motorcycle"}]

See the COCO website for a description of the COCO format.

Important

The id in categories should start from 1.

Instance Segmentation – COCO format

Using the COCO format requires data to be organized in this structure:

Copy
Copied!

            
            annotation{
"id": int,
"image_id": int,
"category_id": int,
"segmentation": RLE or [polygon],
"area": float,
"bbox": [x,y,width,height],
"iscrowd": 0 or 1,
}

image{
"id": int,
"width": int,
"height": int,
"file_name": str,
"license": int,
"flickr_url": str,
"coco_url": str,
"date_captured": datetime,
}

categories[{
"id": int,
"name": str,
"supercategory": str,
}]

An example COCO annotation file is shown below:

Copy
Copied!

            
            "annotations": [{"segmentation": [[510.66,423.01,511.72,420.03,510.45,416.0,510.34,413.02,510.77,410.26,510.77,407.5,510.34,405.16,511.51,402.83,511.41,400.49,510.24,398.16,509.39,397.31,504.61,399.22,502.17,399.64,500.89,401.66,500.47,402.08,499.09,401.87,495.79,401.98,490.59,401.77,488.79,401.77,485.39,398.58,483.9,397.31,481.56,396.35,478.48,395.93,476.68,396.03,475.4,396.77,473.92,398.79,473.28,399.96,473.49,401.87,474.56,403.47,473.07,405.59,473.39,407.71,476.68,409.41,479.23,409.73,481.56,410.69,480.4,411.85,481.35,414.93,479.86,418.65,477.32,420.03,476.04,422.58,479.02,422.58,480.29,423.01,483.79,419.93,486.66,416.21,490.06,415.57,492.18,416.85,491.65,420.24,492.82,422.9,493.56,424.39,496.43,424.6,498.02,423.01,498.13,421.31,497.07,420.03,497.07,415.15,496.33,414.51,501.1,411.96,502.06,411.32,503.02,415.04,503.33,418.12,501.1,420.24,498.98,421.63,500.47,424.39,505.03,423.32,506.2,421.31,507.69,419.5,506.31,423.32,510.03,423.01,510.45,423.01]],"area": 702.1057499999998,"iscrowd": 0,"image_id": 289343,"bbox": [473.07,395.93,38.65,28.67],"category_id": 18,"id": 1768}],
"images": [{"license": 1,"file_name": "000000407646.jpg","coco_url": "http://images.cocodataset.org/val2017/000000407646.jpg","height": 400,"width": 500,"date_captured": "2013-11-23 03:58:53","flickr_url": "http://farm4.staticflickr.com/3110/2855627782_17b93a684e_z.jpg","id": 407646}],
"categories": [{"supercategory": "person","id": 1,"name": "person"},{"supercategory": "vehicle","id": 2,"name": "bicycle"},{"supercategory": "vehicle","id": 3,"name": "car"},{"supercategory": "vehicle","id": 4,"name": "motorcycle"}]

See the COCO website for a description of the COCO format.

Important

The id in categories should start from 1.

Semantic Segmentation - PNG Mask Format

This section describes the dataset formats supported by UNet and SegFormer for loading images and masks.

Note

If you have the masks saved in COCO format, refer to :ref: sample_usage_of_the_dataset_converter_tool_unet<Dataset Converter> to convert COCO format to UNet mask image format.

Semantic Segmentation Mask Format

This section describes the format of the mask images for different types of input_image_type or input_type. Refer dataset_config_unet for more information on configuring the input_image_type for UNet. Refer dataset_config_segformer for more information on configuring the input_type for SegFormer.

Color/ RGB Input Image Type

For color/ RGB input images, each mask image is a single-channel or three-channel image with the size equal to the input image. Every pixel in the mask should have an integer value that represents the segmentation class label_id, as per the mapping provided in the dataset_config_unet and dataset_config_segformer. Ensure that the value of the pixels in the mask image are within the range of the label_id values provided in the dataset_config and dataset_config_segformer.

For a reference example, refer to the _labelIds.png images format in the Cityscapes Dataset.

Grayscale Input Image Type

For grayscale input images, the mask is a single channel image with size equal to the input image. Every pixel has a value of 255 or 0, which corresponds respectively to a label_id of 1 or 0 in the dataset_config and dataset_config_segformer. For reference, refer to ISBI dataset Jupyter Notebook example provided in ngcresources.

Image and Mask Loading Format

SegFormer

For SegFormer, the path to images and mask folders can directly be provided in the dataset_config_segformer. Ensure that the image and the corresponding mask names are same. The image and mask extension don’thave to be the same.

UNet

Structured Images and Masks Folders for UNet

The data folder structure for images and masks must be in the following format for UNet:

Copy
Copied!

            
            /Dataset_01
    /images
      /train
        0000.png
        0001.png
        ...
        ...
        N.png
      /val
        0000.png
        0001.png
        ...
        ...
        N.png
      /test
        0000.png
        0001.png
        ...
        ...
        N.png
    /masks
      /train
        0000.png
        0001.png
        ...
        ...
        N.png
      /val
        0000.png
        0001.png
        ...
        ...
        N.png

See the Folders based Dataset Config section for further details about configuring these image and mask folder paths in experiment spec.
Each image and label has the same file ID before the extension. The image-to-label correspondence is maintained using this filename. The test folder in the above directory structure is optional; any folder can be used for inference.

Image and Mask Text Files for UNet

An image text file must contain the full abosolute UNIX paths to all the images. A mask text file must contain the full abosolute UNIX paths to the corresponding mask files.

Contents of an example images text file images_source1.txt follows:

Copy
Copied!

            
            /home/user/workspace/exports/images_final/00001.jpg
/home/user/workspace/exports/images_final/00002.jpg

Contents of example corresponding mask text file labels_source1.txt is shown below. It contains the corresponding mask names:

Copy
Copied!

            
            /home/user/workspace/exports/masks_final/00001.png
/home/user/workspace/exports/masks_final/00002.png

Text file method additionally allows you to specify multiple sequences.
These text file paths should be provided in a spec file.

See Text files based Dataset Config section for further details about configuring multiple data sources using text files in dataset config.

Note

The size of the images doesn’t need be equal to the model input dimensions. The images are resized internally to model input dimensions.

Gesture Recognition – Custom Format

A gesture recognition model should perform well on users outside the training dataset. Thus, model training requires user segregation when splitting into train, validation, and test dataset. To enable this a unique identifier, user_id, is required for each subject. In addition, record multiple videos for each subject.

Organize the dataset in the following format:

Copy
Copied!

            
            .
|-- original dataset root
  |-- uid_1
      |-- session_1
          |-- 000000.png
          |-- 000001.png
                .
                .
          |-- xxxxxx.png
      |-- session_2
          |-- 000000.png
          |-- 000001.png
                .
                .
          |-- xxxxxx.png
  |-- uid_2
      |-- session_1
          |-- 000000.png
          |-- 000001.png
                .
                .
          |-- xxxxxx.png
      |-- session_2
          |-- 000000.png
          |-- 000001.png
                .
                .
          |-- xxxxxx.png
  |-- uid_3
      |-- session_1
          |-- 000000.png
          |-- 000001.png
                .
                .
          |-- xxxxxx.png

For each set, prepare a metadata file that captures fields that can be used for dataset sampling.

Copy
Copied!

            
            {
    "set": "data",
    "users": {
        "uid_1": {
            "location": "outdoor",
            "illumination": "good",
            "class_fps": {
                "session_1": 30,
                "session_2": 30
            }
        },
        "uid_2": {
            "location": "indoor",
            "illumination": "good",
            "class_fps": {
                "session_1": 10,
                "session_2": 15
            }
        },
        "uid_3": {
            "location": "indoor",
            "illumination": "poor",
            "class_fps": {
                "session_1": 10
            }
        }
    }
}

Label Format

Each image corresponds to a subject performing a gesture. The image requires a corresponding label JSON which contains a bounding box for the hand of interest and gesture label. These follow the Label Studio format. A sample label for an image is:

Copy
Copied!

            
            {
  "completions": [
    {
      "result": [
        {
          "type": "rectanglelabels",
          "original_width": 320,
          "original_height": 240,
          "value": {
            "x": 58.1,
            "y": 18.3,
            "width": 18.8,
            "height": 49.5
          }
        },
        {
          "type": "choices",
          "value": {
            "choices": [
              "Thumbs-up"
            ]
          }
        }
      ]
    }
  ],
  "task_path": "/workspace/tao-experiments/gesturenet/data/uid_1/session_1/image_0001.png"
}

task_path: Specifies the full path to the image.
completions: This is a chunk that contains the labels under results.

The bounding box and gesture class are separate entries with the following type.

rectanglelabels: specifies the label corresponding to hand bounding box.

Parameter name	Description	Type	Range
type	The type of label	String	rectanglelabels
original_width	Width of image being labelled (in pixels)	Integer	[1, inf)
original_height	Height of image being labelled (in pixels)	Integer	[1, inf)
value[“x”]	x coordinate of top left corner of hand bounding box (as a percentage of image width)	Float	[0, 100]
value[“y”]	y coordinate of top left corner of hand bounding box (as a percentage of image height)	Float	[0, 100]
value[“width”]	Width of the hand bounding box (as a percentage of image width)	Float	[0, 100]
value[“height”]	Height of the hand bounding box (as a percentage of image height)	Float	[0, 100]

choices: specifies the label corresponding to gesture class.

Parameter name	Description	Type	Range
type	The type of label	String	choices
value[“choices”]	List of attributes. For GestureNet app this will be a single entry with gesture class name	List of strings	Valid gesture classes

The dataset_convert tool requires an extraction and experiment configuration spec files input. The details of the configuration files and sample usage examples are included on the Gesture Recognition page.

Heart Rate Estimation – Custom Format

HeartRateNet expects directories of images in the format shown below. The images and ground truth labels are then converted to TFRecords for training.

Copy
Copied!

            
            Subject_001/
    ground_truth.csv
    image_timestamps.csv
    images/
        0000.bmp
        0001.bmp
            .
            .
        N.bmp
.
.
Subject_M/
    ground_truth.csv
    image_timestamps.csv
    images/
        0000.bmp
        0001.bmp
            .
            .
        Y.bmp

EmotionNet, FPENET, GazeNet – JSON Label Data Format

EmotionNet, FPENet, and GazeNet use the same JSON data format labeled by the NVIDIA data factory team. These apps expect data in this JSON data format for training and evaluation. For EmotionNet, FPENet, and GazeNet, this data is converted to TFRecords for training. TFRecords help iterate faster through the data. See the corresponding section for the JSON data format descriptions.

Using the JSON Label data format requires that data be organized in a JSON file with the following structure:

Copy
Copied!

            
            .
{
     "filename": "data/001_01_02_200_06.png",
     "class": "image",
     "annotations": [
         {
             "class": "FaceBbox",
             "tool-version": "1.0",
             "Occlusion": 0,
             "face_outer_bboxx": 269.0082935424086,
             "face_outer_bboxy": 44.33839032556304,
             "face_outer_bboxwidth": 182.97858097042064,
             "face_outer_bboxheight": 276.28773076003836,
             "face_tight_bboxx": 269.211755426433,
             "face_tight_bboxy": 147.9049289218409,
             "face_tight_bboxwidth": 182.58110482105968,
             "face_tight_bboxheight": 172.5088694283426
         },
         {
             "class": "FiducialPoints",
             "tool-version": "1.0",
             "P1x": 304.8502837500011,
             "P1y": 217.10946645000078,
             "P2x": 311.0173699500011,
             "P2y": 237.15249660000086,
             .
             .
             "P26occluded": true,
             "P46occluded": true,
             .
             .
             "P68x": 419.5885050000024,
             "P68y": 267.6976650000015,
             .
             .
             "P104x": 429.6,
             "P104y": 189.5,
         },
         {
             "class": "eyes"
             "tool-version": "1.0",
             "l_eyex": 389.1221901922325,
             "l_eyey": 197.94528259092206,
             "r_eyex": 633.489814294182,
             "r_eyey": 10.52527209626886,
             "l_status": "open",
             "r_status": "occluded",
         }
     ]
 }

Here’s a description of the structure:

filename field: specifies the path to the images to train on.
class field: category of the labels for the respective section.
annotation field: annotation chunk.

There are three supported chunk types in the annotation including FaceBbox, FiducialPoints, and eyes.

FaceBox chunk: This is a chunk that describe Face Bounding Box labeling information.

Parameter name	Description	Type	Range	Example
class	The class for the annotation chunk	String	N/A	FaceBbox
`tool-version`	Version of the labeling tool for this chunk	Float	N/A	`1.0`
`Occlusion`	Occlusion state [ 0 = not occlused, 1 = occluded ]	Integer	0 or 1	`0`
`face_outer_bboxx`	x coordinate of top left corner of outer face bounding box	Float	[0, image_width]	`269.05`
`face_outer_bboxy`	y coordinate of top left corner of outer face bounding box	Float	[0, image_height]	`44.33`
`face_outer_bboxwidth`	Width of the outer face bounding box	Float	[0, image_width]	`182.97`
`face_outer_bboxheight`	Height of the outer face bounding box	Float	[0, image_height]	`276.28`
`face_tight_bboxx`	x coordinate of top left corner of tight face bounding box	Float	[0, image_width]	`269.21`
`face_tight_bboxy`	y coordinate of top left corner of outer face bounding box	Float	[0, image_height]	`147.90`
`face_tight_bboxwidth`	Width of the outer face bounding box	Float	[0, image_width]	`182.58`
`face_tight_bboxheight`	Height of the outer face bounding box	Float	[0, image_height]	`172.50`

FiducialPoint chunk: This is a chunk that describes Fiducial Point Labeling information.

Parameter name	Description	Type	Range	Example
class	The class for the annotation chunk	String	N/A	FaceBbox
`tool-version`	Version of the labeling tool for this chunk	Float	N/A	`1.0`
`Occlusion`	Occlusion status [ 0 = not occlused, 1 = occluded ]	Integer	0 or 1	`0`
`Pix`	x coordinate of the ith landmarks point	Float	[0, image_width]	`304.85`
`Piy`	y coordinate of the ith landmarks point	Float	[0, image_height]	`217.10`
`Pioccluded`	Width of the outer face bounding box	String	N/A	`true`

eyes chunk: This is a chunk that describes eyes labeling information. This chunk is not required.

Parameter name	Description	Type	Range	Example
class	The class for the annotation chunk	String	N/A	FaceBbox
`tool-version`	Version of the labeling tool for this chunk	Float	N/A	`1.0`
`l_eyex`	x coordinate of left eye center	Float	[0, image_width]	`389.12`
`l_eyey`	y coordinate of left eye center	Float	[0, image_height]	`197.94`
`r_eyex`	x coordinate of right eye center	Float	[0, image_width]	`633.48`
`r_eyey`	y coordinate of right eye center	Float	[0, image_height]	`182.97`
`l_status`	Status of the left eye	String	open/close/barely open/half open/occluded	`open`
`r_status`	Status of the right eye	String	open/close/barely open/half open/occluded	`occluded`

Here’s an example of a JSON file with a sample dataset with two image frames:

Copy
Copied!

            
            [
    {
        "filename": "data/001_01_02_200_06.png",
        "class": "image",
        "annotations": [
            {
                "face_outer_bboxy": 44.33839032556304,
                "face_outer_bboxx": 269.0082935424086,
                "face_tight_bboxx": 269.211755426433,
                "face_tight_bboxy": 147.9049289218409,
                "tool-version": "1.0",
                "face_tight_bboxwidth": 182.58110482105968,
                "face_tight_bboxheight": 172.5088694283426,
                "face_outer_bboxwidth": 182.97858097042064,
                "Occlusionx": 0,
                "class": "FaceBbox",
                "face_outer_bboxheight": 276.28773076003836
            },
            {
                "P91x": 395.3500000000004,
                "P91y": 196.6500000000002,
                "P74occluded": true,
                "P28x": 436.44144340908053,
                "P28y": 174.67157210032852,
                "P52y": 252.53100000000143,
                "P52x": 428.9925000000024,
                "P32y": 236.48449500000103,
                "P32x": 416.6063550000018,
                "P44x": 427.65443026467267,
                "P44y": 186.9615161604129,
                "P99x": 425.75,
                "P36occluded": true,
                "P75x": 428.85,
                "P75y": 190.95000000000002,
                "P20x": 389.46879000000166,
                "P20y": 178.13376000000076,
                "P8y": 313.8318038340011,
                "P8x": 407.70466707150143,
                "P81y": 192.2500000000002,
                "P94x": 427.70000000000005,
                "P81x": 393.5500000000004,
                "P12y": 268.179948238501,
                "P12x": 408.69280247400155,
                "P65y": 260.04348000000147,
                "P65x": 429.0319800000024,
                "P84x": 396.8500000000004,
                "P84y": 194.4500000000002,
                "P93occluded": true,
                "P46occluded": true,
                "P43y": 193.31428917697824,
                "P43x": 421.12354211680173,
                "P14occluded": true,
                "P92y": 187.5,
                "P54occluded": true,
                "P53x": 433.50450000000245,
                "P53y": 251.9670000000014,
                "P45occluded": true,
                "P33x": 426.3480450000019,
                "P33y": 238.67140500000104,
                "P60x": 413.82301500000233,
                "P100occluded": true,
                "P60y": 272.07148500000153,
                "P23y": 174.7903155211989,
                "P23x": 428.12940394815394,
                "P90y": 194.9000000000002,
                "P13x": 399.2067026100015,
                "P13y": 257.903340052501,
                "P7x": 388.1395861020014,
                "P7y": 304.93858521150105,
                "P61y": 262.1309850000015,
                "P104x": 429.6,
                "P104y": 189.5,
                "P83y": 193.2500000000002,
                "P83x": 395.0000000000004,
                "P61x": 404.5783500000023,
                "P50y": 254.6756100000014,
                "P50x": 414.2206350000023,
                "P100x": 424.8,
                "P100y": 191.3,
                "P34y": 240.46069500000107,
                "P34x": 435.9903300000019,
                "P18y": 188.2730700000008,
                "P18x": 366.50623500000154,
                "P25occluded": true,
                "P102occluded": true,
                "P46x": 436.0852131464696,
                "P46y": 191.82999641609848,
                "P58y": 275.0536350000016,
                "P58x": 429.2307900000024,
                "P77x": 306.5418228495726,
                "P77y": 258.61884245799524,
                "P97occluded": true,
                "P99y": 192.9,
                "P10y": 293.87146870350114,
                "P10x": 434.97720418050164,
                "P48occluded": true,
                "P26x": 436.0258414360342,
                "P26y": 171.99984513074497,
                "version": "v1",
                "P27occluded": true,
                "P86x": 397.8000000000004,
                "P86y": 198.45000000000022,
                "P73occluded": true,
                "P98occluded": true,
                "P2y": 237.15249660000086,
                "P90x": 393.3500000000004,
                "P29y": 203.3826300000009,
                "P29x": 433.6046100000019,
                "P101y": 188.85000000000002,
                "P101x": 425.65000000000003,
                "P51x": 423.6641100000023,
                "P51y": 252.5881050000014,
                "P35x": 436.78557000000194,
                "P35y": 239.26783500000104,
                "P66x": 433.70401500000247,
                "P66y": 268.0952850000015,
                "P19x": 378.4348350000016,
                "P19y": 181.61293500000076,
                "P98y": 193.45000000000002,
                "P98x": 427.85,
                "P45y": 187.0802595812833,
                "P45x": 433.2353710455805,
                "P21y": 176.44387500000076,
                "P21x": 398.1170250000017,
                "P59x": 422.1730350000024,
                "P59y": 274.25839500000154,
                "P9x": 431.0246625705015,
                "P9y": 312.25078719000106,
                "P17occluded": true,
                "P11x": 422.7243251895016,
                "P11y": 281.81621679300105,
                "P70y": 195.95000000000002,
                "P79occluded": true,
                "P95occluded": true,
                "P70x": 395.20000000000005,
                "P1x": 304.8502837500011,
                "P13occluded": true,
                "P85y": 196.6500000000002,
                "P85x": 398.1000000000004,
                "P69y": 196.95000000000002,
                "P24x": 433.0572559142747,
                "P36y": 236.88211500000105,
                "P36x": 427.5409050000019,
                "P94occluded": true,
                "P104occluded": true,
                "P47occluded": true,
                "P40x": 401.35650000000186,
                "P40y": 197.40000000000092,
                "P71x": 396.40000000000003,
                "P71y": 196.8,
                "P65occluded": true,
                "P26occluded": true,
                "P56y": 273.06553500000155,
                "P56x": 433.0081800000024,
                "P16occluded": true,
                "P89y": 196.2500000000002,
                "P89x": 392.4500000000004,
                "P48x": 428.54500592120047,
                "P48y": 195.45167075264504,
                "P16y": 216.4016531475008,
                "P16x": 360.47179483200136,
                "P15occluded": true,
                "P24y": 170.63429579073562,
                "P78x": 276.3975906000002,
                "class": "FiducialPoints",
                "P74y": 190.10000000000002,
                "P4y": 270.1562190435009,
                "P4x": 329.2467161130011,
                "P96y": 191.10000000000002,
                "P74x": 427.85,
                "P103y": 195.00000000000003,
                "P103x": 396.4500000000001,
                "P80x": 330.41417158035716,
                "P80y": 178.5832276794402,
                "P37x": 381.05250000000177,
                "P37y": 200.64300000000094,
                "P47y": 195.09544049003392,
                "P47x": 433.47285788732125,
                "P64x": 432.80937000000245,
                "P64y": 255.47085000000143,
                "P76y": 191.60000000000002,
                "P57y": 271.77327000000156,
                "P99occluded": true,
                "P43occluded": true,
                "P88x": 392.8500000000004,
                "P88y": 198.45000000000022,
                "P17x": 335.9660368500013,
                "P17y": 206.7179262030008,
                "P96x": 431.05,
                "P67y": 268.3935000000015,
                "P27y": 173.42476618118954,
                "P27x": 436.38207169864535,
                "P87y": 199.45000000000022,
                "P87x": 395.1000000000004,
                "P3x": 316.76397300000116,
                "P67x": 426.8450700000024,
                "P96occluded": true,
                "P12occluded": true,
                "P97x": 430.35,
                "P97y": 193.05,
                "P101occluded": true,
                "P55occluded": true,
                "P93x": 429.05,
                "P93y": 195.4,
                "P42x": 388.6665000000018,
                "P42y": 200.64300000000094,
                "P79y": 238.89320075909347,
                "P54y": 252.24900000000142,
                "P54x": 431.5305000000024,
                "P73x": 427.05,
                "P73y": 191,
                "P68y": 267.6976650000015,
                "P30y": 214.61539500000094,
                "P30x": 440.86117500000194,
                "P14y": 243.47656317600092,
                "P14x": 384.18704449200146,
                "P63y": 254.87442000000144,
                "P76occluded": true,
                "P22x": 406.8646650000017,
                "P22y": 176.94090000000077,
                "P28occluded": true,
                "P6y": 296.24299366950106,
                "P6x": 367.5863697300013,
                "P92x": 428.85,
                "P38y": 193.3815000000009,
                "P38x": 388.5255000000018,
                "P94y": 188.5,
                "P72y": 197.70000000000002,
                "P72x": 395.65000000000003,
                "P78y": 210.5218971000002,
                "P63x": 427.8391200000024,
                "P35occluded": true,
                "P82x": 393.8000000000004,
                "P82y": 200.95000000000022,
                "P11occluded": true,
                "tool-version": "1.0",
                "P41y": 200.99550000000093,
                "P41x": 396.5625000000018,
                "P56occluded": true,
                "P55x": 425.0508679558401,
                "P55y": 259.9172483306748,
                "P31x": 449.410005000002,
                "P31y": 225.351135000001,
                "P1y": 217.10946645000078,
                "P75occluded": true,
                "P62x": 420.38374500000236,
                "P62y": 256.06728000000146,
                "P15x": 373.5151821450014,
                "P15y": 228.45690505800087,
                "P49y": 261.4140000000014,
                "P49x": 400.0875000000022,
                "P25y": 170.87178263247637,
                "P25x": 435.25400920037674,
                "P2x": 311.0173699500011,
                "P80occluded": true,
                "P3y": 251.86940685000093,
                "P39x": 397.33800000000184,
                "P39y": 192.1830000000009,
                "P69x": 394.6,
                "P5x": 347.3103508088991,
                "P5y": 287.4697160411496,
                "P95x": 430,
                "P95y": 189.25,
                "P79x": 368.8999131564783,
                "P57x": 434.7974700000025,
                "P102x": 428.1,
                "P102y": 190.85000000000002,
                "P76x": 428.25
            },
            {
                "l_eyex": 389.1221901922325,
                "l_eyey": 197.94528259092206,
                "tool-version": "1.0",
                "l_status": "open",
                "r_status": "occluded",
                "r_eyex": 633.489814294182,
                "r_eyey": 10.52527209626886,
                "class": "eyes"
            }
        ]
    },
    {
        "filename": "data/001_03_01_130_05.png",
        "class": "image",
        "annotations": [
            {
                "face_outer_bboxy": 36.21548211860577,
                "face_outer_bboxx": 259.54428851667467,
                "face_tight_bboxx": 265.58020220310897,
                "face_tight_bboxy": 116.19133846386018,
                "tool-version": "1.0",
                "face_tight_bboxwidth": 191.64025954428882,
                "face_tight_bboxheight": 192.64624515869457,
                "face_outer_bboxwidth": 198.68215884512887,
                "Occlusionx": 0,
                "class": "FaceBbox",
                "face_outer_bboxheight": 273.62808711835464
            },
            {
                "P91x": 283.35,
                "P91y": 179.55,
                "P28x": 304.14947850000084,
                "P28y": 176.3226009000005,
                "P5occluded": true,
                "P52y": 244.28250000000094,
                "P52x": 305.0535000000012,
                "P32y": 220.38088500000066,
                "P32x": 289.76557500000087,
                "P44x": 334.8750000000012,
                "P44y": 168.63600000000062,
                "P99x": 340.20000000000005,
                "P99y": 174.75,
                "P75x": 343.90000000000003,
                "P75y": 171.70000000000002,
                "P20x": 269.9839800000006,
                "P20y": 158.94859500000035,
                "P8y": 299.437842994699,
                "P8x": 301.7845345542186,
                "P94x": 342.70000000000005,
                "P12y": 272.68555921617576,
                "P12x": 389.08146056834715,
                "P65y": 249.500000000001,
                "P65x": 321.9500000000013,
                "P84x": 285.8,
                "P84y": 175.5,
                "P43y": 176.03850000000065,
                "P43x": 329.9400000000012,
                "P68x": 302.05,
                "P68y": 252.55,
                "P92y": 165.70000000000002,
                "P92x": 343.40000000000003,
                "P53x": 311.11650000000117,
                "P53y": 241.18050000000093,
                "P33x": 295.53106500000086,
                "P33y": 224.95351500000066,
                "P60x": 297.5100000000011,
                "P60y": 258.382500000001,
                "P23y": 149.55274915302633,
                "P23x": 325.5457633496816,
                "P90y": 177.15,
                "P13x": 406.681647264744,
                "P13y": 256.7280566114426,
                "P7x": 292.6324374720922,
                "P90x": 280.25,
                "P58x": 309.7065000000012,
                "P61y": 253.51800000000097,
                "P104x": 346.0500000000002,
                "P104y": 171.35000000000008,
                "P83y": 174.70000000000002,
                "P83x": 282.15000000000003,
                "P61x": 296.31150000000116,
                "P50y": 249.21750000000097,
                "P50x": 294.19650000000115,
                "P100x": 339.45000000000005,
                "P100y": 171.75,
                "P34y": 224.85411000000067,
                "P34x": 300.8989350000009,
                "P18y": 170.97660000000036,
                "P18x": 268.59231000000057,
                "P46x": 357.7170000000013,
                "P46y": 172.86600000000064,
                "P58y": 258.664500000001,
                "P4occluded": true,
                "P77x": 300.22496910000007,
                "P77y": 221.17413690000006,
                "tool-version": "1.0",
                "P10y": 298.73383552684317,
                "P10x": 341.67829106605154,
                "P26x": 361.5470170921228,
                "P26y": 148.3723801778643,
                "version": "v1",
                "P86x": 286.6,
                "P86y": 181.9,
                "P2y": 204.16216567820388,
                "P2x": 300.6111887744588,
                "P29y": 189.63790065000055,
                "P29x": 301.90690170000084,
                "P101y": 168.8,
                "P101x": 340.40000000000003,
                "P51x": 298.49700000000115,
                "P51y": 243.57750000000092,
                "P35x": 310.83943500000095,
                "P35y": 223.36303500000068,
                "P66x": 313.85,
                "P66y": 251.05,
                "P19x": 267.8964750000006,
                "P19y": 165.11170500000037,
                "P98y": 176.55,
                "P98x": 342.85,
                "P45y": 165.8865000000006,
                "P45x": 347.2125000000013,
                "P21y": 156.26466000000033,
                "P21x": 276.4453050000006,
                "P59x": 303.2910000000012,
                "P59y": 258.664500000001,
                "P9x": 316.33402222324,
                "P9y": 303.89655695778623,
                "P17occluded": true,
                "P11x": 366.0838832850552,
                "P11y": 286.0617011054374,
                "P70y": 178.35000000000002,
                "P70x": 283.15000000000003,
                "P1x": 307.6512634530176,
                "P1y": 189.14333969727852,
                "P85y": 178.60000000000002,
                "P85x": 287.5,
                "P69y": 179.3,
                "P69x": 282.65000000000003,
                "P36y": 219.78445500000066,
                "P36x": 326.446020000001,
                "P77occluded": true,
                "P81y": 173.25,
                "P81x": 281.95,
                "P40x": 298.00350000000105,
                "P40y": 178.85850000000062,
                "P71x": 283.95,
                "P71y": 178.9,
                "P56y": 254.505000000001,
                "P56x": 323.24250000000126,
                "P7y": 284.1843478578217,
                "P89y": 179.85000000000002,
                "P89x": 279.90000000000003,
                "P48x": 338.96400000000125,
                "P48y": 177.51900000000066,
                "P16y": 205.1008423020117,
                "P16x": 420.9964657778135,
                "P24x": 338.1757113839151,
                "P24y": 146.9559374076699,
                "class": "FiducialPoints",
                "P74y": 170.85000000000002,
                "P4y": 234.6691559519585,
                "P4x": 290.9897533804285,
                "P96y": 172,
                "P74x": 342.95000000000005,
                "P3occluded": true,
                "P78occluded": true,
                "P103y": 179.4,
                "P103x": 285.55,
                "P80x": 444.78450000000055,
                "P80y": 173.78250000000023,
                "P37x": 275.44350000000094,
                "P37y": 182.80650000000063,
                "P47y": 176.95500000000064,
                "P47x": 350.5965000000013,
                "P64x": 313.45000000000124,
                "P64y": 249.95000000000098,
                "P76y": 172.65,
                "P57y": 257.113500000001,
                "P6occluded": true,
                "P88x": 281.1,
                "P88y": 182.60000000000002,
                "P17x": 420.2924583099576,
                "P17y": 187.96999391751874,
                "P96x": 347.70000000000005,
                "P67y": 252.45000000000002,
                "P27y": 153.68404056609333,
                "P27x": 370.04567371328926,
                "P87y": 183.5,
                "P87x": 283.95,
                "P3x": 295.44846734351574,
                "P67x": 307.95000000000005,
                "P2occluded": true,
                "P97x": 346.25,
                "P97y": 175.35000000000002,
                "P93x": 344.35,
                "P93y": 177.8,
                "P42x": 280.30800000000096,
                "P42y": 185.34450000000064,
                "P54y": 243.78900000000093,
                "P54x": 321.0570000000012,
                "P73x": 342.5,
                "P73y": 171.95000000000002,
                "P30y": 201.83191200000059,
                "P30x": 298.26271440000085,
                "P14y": 239.83187738290155,
                "P14x": 416.77242097067824,
                "P63y": 251.20000000000098,
                "P63x": 307.7500000000012,
                "P22x": 284.2983000000006,
                "P22y": 157.95454500000034,
                "P1occluded": true,
                "P6y": 270.10419850070423,
                "P6x": 289.34706928876477,
                "P38y": 176.17950000000062,
                "P38x": 277.06500000000096,
                "P94y": 166.65,
                "P72y": 180.35000000000002,
                "P72x": 283.65000000000003,
                "P78y": 176.32260090000005,
                "P78x": 318.5860666500001,
                "P82x": 285.6,
                "P82y": 185.35000000000002,
                "P32occluded": true,
                "P41y": 183.37050000000062,
                "P41x": 290.460000000001,
                "P55x": 334.9455000000013,
                "P55y": 251.89650000000097,
                "P31x": 295.59965445000086,
                "P31y": 213.04479600000062,
                "P79y": 232.39037411700184,
                "P62x": 301.6000000000012,
                "P62y": 251.30000000000098,
                "P15x": 420.05778915400566,
                "P15y": 222.93569815436055,
                "P49y": 256.690500000001,
                "P49x": 291.65850000000114,
                "P25y": 145.8936053300241,
                "P25x": 350.45154872559993,
                "P3y": 218.94632250317727,
                "P39x": 286.30050000000097,
                "P39y": 173.50050000000059,
                "P5x": 288.8777309768609,
                "P5y": 253.44268842811516,
                "P95x": 346.6,
                "P95y": 168.25,
                "P79x": 431.0439612134159,
                "P57x": 315.4875000000012,
                "P102x": 343.1,
                "P102y": 172,
                "P76x": 343.20000000000005
            },
            {
                "l_eyex": 289.90000000000003,
                "l_eyey": 179.60000000000002,
                "tool-version": "1.0",
                "l_status": "open",
                "r_status": "open",
                "r_eyex": 337.4000000000001,
                "r_eyey": 173.35000000000005,
                "class": "eyes"
            }
        ]
    }

BodyposeNet – COCO Format

Using the COCO format requires data to be organized in this structure:

Copy
Copied!

            
            |-- dataset root
    |-- train2017
        |-- 000000001000.jpg
        |-- 000000001001.jpg
            .
            .
        |-- xxxxxxxxxxxx.jpg
    |-- val2017
        |-- 000000002000.jpg
        |-- 000000002001.jpg
            .
            .
        |-- xxxxxxxxxxxx.jpg
    |-- annotations
        |-- person_keypoints_train2017.json
        |-- person_keypoints_val2017.json

You can choose to have a nested directory structure for the train and test images, as long as you have a dataset root, and the filenames are adjusted accordingly in the images->filename field in annotations.

Label Files

This section outlines the COCO annotations dataset format that the data must be in for BodyPoseNet. Although COCO annotations have more fields, only the attributes that are needed by BodyPoseNet are mentioned here. You can use the exact same format as COCO. The dataset should use the following overall structure (in a .json file):

Copy
Copied!

            
            "images": [
    {
        "file_name": "000000001000.jpg",
        "height": 480,
        "width": 640,
        "id": 1000
    },
    {
        "file_name": "000000580197.jpg",
        "height": 480,
        "width": 640,
        "id": 580197
    },
    ...
],
"annotations": [
    {
        "segmentation": [[162.46,152.13,150.73,...173.92,156.23]],
        "num_keypoints": 17,
        "area": 8720.28915,
        "iscrowd": 0,
        "keypoints": [162,174,2,...,149,352,2],
        "image_id": 1000,
        "bbox": [115.16,152.13,83.23,228.41],
        "category_id": 1,
        "id": 1234574
    },
    ...
],
"categories": [
    {
        "supercategory": "person",
        "id": 1,
        "name": "person",
        "keypoints": [
            "nose","left_eye","right_eye","left_ear","right_ear",
            "left_shoulder","right_shoulder","left_elbow","right_elbow",
            "left_wrist","right_wrist","left_hip","right_hip",
            "left_knee","right_knee","left_ankle","right_ankle"
        ],
        "skeleton": [
            [16,14],[14,12],[17,15],[15,13],[12,13],[6,12],[7,13],[6,7],
            [6,8],[7,9],[8,10],[9,11],[2,3],[1,2],[1,3],[2,4],[3,5],[4,6],[5,7]
        ]
    }
]

The images section contains the complete list of images in the dataset with some metadata.

Note

Image IDs need to be unique among other images.

Parameter name	Description	Type	Range
`file_name`	The path to the image	String	N/A
`height`	The height of the image	Integer	N/A
`width`	The width of the image	Float	N/A
`id`	The unique ID of the image	Integer	N/A

The annotations section contains the labels for the images. Each entity is one annotation, and each image can have multiple annotations.

Parameter name	Description	Type	Range
`segmentation`	A list of polygons, which has a list of vertices for a given person/group.	List	N/A
`num_keypoints`	The number of keypoints that are labeled	Integer	[0, total_keypoints]
`area`	The area of the segmentation/bbox	Float	N/A
`iscrowd`	If 1, indicates that the annotation mask is for multiple people	Integer	[0, 1]
`keypoints`	A list of keypoints with the following format: `[x1, y1, v1, x2, y2, v2 ...]`, where x and y are pixel locations, and v is the visibility/occlusion flag.	List	N/A
`bbox`	The bbox of the object/person	List	N/A
`image_id`	The unique ID of the associated image	Integer	N/A
`category_id`	The object category (always `1` for person)	Integer	1
`id`	The unique ID of the annotation	Integer	N/A

The COCO dataset follows the following occlusion flag labeling format: [visible: 2, occluded: 1, not_labeled: 0].
The categories section contains the keypoint convention that is followed in the dataset.

Parameter name	Description	Type	Range
`supercategory`	The supercategory	String	person
`id`	The ID of the category	Integer	1
`name`	The name of the category	String	person
`keypoints`	The keypoint names and ordering convention as used in labeling	List	N/A
`skeleton`	A list of skeleton edges with the following format: `[[j1, j2], [j2, j3] ...]`, where j is the keypoint/joint index.	List	N/A

For more details, see the COCO keypoint annotations file and COCO Keypoint Detection Task.

Re-Identification – Market-1501 Format

Using the Market-1501 format requires data to be organized in this structure:

Copy
Copied!

            
            |-- dataset root
    |-- bounding_box_train
        |-- 0002_c1s1_000451_03.jpg
        |-- 0002_c1s1_000551_01.jpg
              .
              .
        |-- 1500_c6s3_086567_01.jpg
    |-- bounding_box_test
        |-- 0000_c1s1_000151_01.jpg
        |-- 0000_c1s1_000376_03.jpg
              .
              .
        |-- 1501_c6s4_001902_01.jpg
    |-- query
        |-- 0001_c1s1_001051_00.jpg
        |-- 0001_c2s1_000301_00.jpg
              .
              .
        |-- 1501_c6s4_001877_00.jpg

The root directory of the dataset contains sub-directories for training, testing, and query. Each sub-directory has the cropped images of different identities. For example, the image 0001_c1s1_01_00.jpg is from the first sequence s1 of camera c1. 01 indicates the first frame in the sequence c1s1. 0001 is the unique ID assigned to the object. The contents after the third _ are ignored. There is no label file required.

For more details, please refer to the Market-1501 Dataset.