Data Annotation Format
This page describes the dataset formats for computer-vision apps supported by TAO.
Image classification expects a directory of images with the following structure, where each class
has its own directory with the class name. The naming convention for train/val/test
can
be different because the path of each set is individually specified in the spec file. See the
Specification File for Classification section for more
information.
|--dataset_root:
|--train
|--audi:
|--1.jpg
|--2.jpg
|--bmw:
|--01.jpg
|--02.jpg
|--val
|--audi:
|--3.jpg
|--4.jpg
|--bmw:
|--03.jpg
|--04.jpg
|--test
|--audi:
|--5.jpg
|--6.jpg
|--bmw:
|--05.jpg
|--06.jpg
Optical Inspection expects directories of images and CSV files in the dataset root directory. The image directory consists of golden images (non-defective reference images) and test images to compare with the golden images for PCB defect classification.
|--dataset_root:
|--images
|--input:
|--C1.jpg
|--C2.jpg
|--golden:
|--C1.jpg
|--C2.jpg
|--input1:
|--C1.jpg
|--C3.jpg
|--golden1:
|--C1.jpg
|--C3.jpg
|--labels
|-- train.csv
|-- validation.csv
Here’s a description of the structure:
- The
images
directory contains the following: input
: Contains input images to be compared with golden images.golden
: Contains golden reference images.
- The
The
labels
directory contains the CSV files for pair-wise image input to the SiameseOI model with corresponding class labels, as described in the Label Files section below.
Label Files
A SiameseOI/VisualChangeNet-Classification label file is a CSV file containing the following fields:
|
|
|
|
---|---|---|---|
/path/to/input/image/directory |
/path/to/golden/image/directory |
The class to which the object belongs | /component/name |
input_path
: The path to the directory containing input compare image.golden_path
: The path to the directory containing corresponding golden reference image.label
: The labels for the pair-wise images (Use PASS for non-defective components, and any other specific defect type label for defective components).object_name
: The name of the component to be compared. The object name is the same for input and golden images and represents the image name without the file extension.For each
object_name
, TAO supports combining multiple LED intensities, camera angles, or different sensory inputs for each of the input and golden images to be compared within the SiameseOI/VisualChangeNet-Classification models. For more details, refer to the Input Mapping section below.
Here is a sample label file corresponding to the sample directory structure as describe in the Optical Inspection Format section:
|
|
|
|
---|---|---|---|
/dataset_root/images/input/ | /dataset_root/images/golden/ | PASS | C1 |
/dataset_root/images/input/ | /dataset_root/images/golden/ | PASS | C2 |
/dataset_root/images/input1/ | /dataset_root/images/golden1/ | MISSING | C1 |
/dataset_root/images/input1/ | /dataset_root/images/golden1/ | PASS | C3 |
In the label file, ensure that non-defective samples are consistently labeled as PASS
,
while defective samples can be assigned any specific defect type label.
The model is designed to treat all defects collectively and train for binary defect classification.
Input Mapping
For comparison within the Siamese Network, SiameseOI and VisualChangeNet-Classification models support combining several lighting conditions (1…N) for each component specified under object_name
for both input and golden images.
The following concat_type
modes are supported:
linear
: Linear concat (1 x N)grid
: Grid concat (M x N)
The SiameseOI/VisualChangeNet-Classification dataloader appends the name of each lighting condition, as specified in the experiment spec under input_map
, to each of the components specified using object_name
in the CSV file. This is done for both input and golden images. Here is an
example of the dataset experiment spec changes for combining four input lighting conditions as a 1x4 linear grid for each component inside object_name
. The dataloader appends each of the 4 lighting condition specified
under input_map
to each object_name
to get the full image paths. These are then merged as a 1x4 grid for both the input and the golden.
dataset:
num_input: 4
concat_type: linear
input_map:
LowAngleLight: 0
SolderLight: 1
UniformLight: 2
WhiteLight: 3
The dataset also supports a single lighting condition per component specified under object_name
in the CSV file as input to the SiameseOI/VisualChangeNet-Classification Model. In this case, the SiameseOI/VisualChangeNet-Classification dataloader does not append anything to the object_name
.
Here is an example of the dataset experiment spec changes for a single lighting condition, where object_name
represents the image name:
dataset:
num_inputs: 1
input_map: null
VisualChangeNet-Segmentation expects directories of images and mask files in the dataset root directory. The image directories consist of golden image directory (pre-change images) and test image directory (post-change image) to compare against the change mask images with pixel level change masks.
|--dataset_root:
|--A
|--image1.jpg
|--image2.jpg
|--B
|--image1.jpg
|--image2.jpg
|--label
|--image1.jpg
|--image2.jpg
|--list
|-- train.txt
|-- val.txt
|-- test.txt
|-- predict.txt
Here’s a description of the structure:
- The
dataset_root
directory contains the following: A
: Contains post-change test images.B
: Contains pre-change golden reference images.label
: Contains ground truth segmentation change masks.list
: Contains .txt files for each dataset split, as described in the List Files section below.
- The
List Files
VisualChangeNet-Segmentation dataloader expects the label
directory to contain .txt
files for each of the dataset split [train, validation, test, inference].
A VisualChangeNet-Segmentation label file is a simple .txt
file containing all file names for the particular split.
|
---|
file_name.png |
image_names
: The names of images. Use the same image names for test images and their corresponding reference and mask images.
Here is a sample label file corresponding to the sample directory structure as describe in the Change Detection (Segmentation) Format section.
|
---|
image1.png |
image3.png |
image2.png |
To map them correctly, each test image (inside directory A
) must have a reference image (inside directory B
) and a segmentation change map (inside directory label
) with the same name for the dataloader.
CenterPose expects directories of images and JSON files in the dataset root directory. CenterPose is a category-level object pose estimation method, which operates the training and evaluation on one object category.
The training directory consist of the images and its related JSON file, which are using the same file name. The testing/evaluation directory also consist of the images and its related JSON file, which use the ground truth to calculate the accuracy. The inference directory can only involve inference images without JSON file. The calibration information needs to be provided in the .yaml file.
|--dataset_root_category:
|--train
|--image1.jpg
|--image1.json
|--image2.jpg
|--image2.json
|--test/val
|--image1.jpg
|--image1.json
|--image2.jpg
|--image2.json
|--inference
|--image1.jpg
|--image2.jpg
The following is a description of the structure:
- The
dataset_root_category
directory contains the following folders for the specific category: train
: Contains the training images and the JSON files.test/val
: Contains the testing/validation images and the JSON files.inference
: Contains inference images.
- The
List Files
CenterPose dataloader expects the .json
files for each of the dataset split [train, validation, test], which provides the calibration information and the ground truth.
Here is a sample directory structure for the dataset split [train, validation, test].
|
---|
file_name.png |
file_name.json |
image_names
: The names of images and JSON files. Image names must be the same as their corresponding ground truth JSON images.
For the inference dataset split, the CenterPose dataloader only expects the inference images, and the intrinsic matrix is loaded from the configure .yaml
file.
The following is a sample directory structure for the dataset split [inference]:
|
---|
image1.png |
image2.png |
To correctly calculate the accuracy, make sure the calibration information is provided and was verified.
Image classification expects a directory of images with the following structure, where each class
has its own directory with a class name. The naming convention for train/val/test
can
be different because the path of each set is individually specified in the spec file. See the
Specification File for Classification PyT section for more
information. In the following example, the respective paths to train, evaluate, and test are in the data_prefix
:
|--data_root:
|--train
|--audi:
|--1.jpg
|--2.jpg
|--bmw:
|--01.jpg
|--02.jpg
|--val
|--audi:
|--3.jpg
|--4.jpg
|--bmw:
|--03.jpg
|--04.jpg
|--test
|--audi:
|--5.jpg
|--6.jpg
|--bmw:
|--05.jpg
|--06.jpg
Optionally, if the images are not in the above structure, additional annotation file can be provided. For an image structure like this:
train/
├── folder_1
│ ├── xxx.png
│ ├── xxy.png
│ └── ...
├── 123.png
├── nsdf3.png
└── ...
An annotation file records all sample paths and the corresponding category index. The first column is the image path relative to the folder (in this example, “train”), and the second column is the category index:
folder_1/xxx.png 0
folder_1/xxy.png 1
123.png 1
nsdf3.png 2
For more details, see the MMPretrain dataset structure documentation.
Using the KITTI format requires data to be organized in this structure:
.
|--dataset root
|-- images
|-- 000000.jpg
|-- 000001.jpg
.
.
|-- xxxxxx.jpg
|-- labels
|-- 000000.txt
|-- 000001.txt
.
.
|-- xxxxxx.txt
|-- kitti_seq_to_map.json
Here’s a description of the structure:
The images directory contains the images to train on.
The labels directory contains the labels to the corresponding images. Details of this file are included in the Label Files section.
NoteThe images and labels have the same file IDs before the extension. The image to label correspondence is maintained using this file name.
The
kitti_seq_to_map.json
file contains a sequence to frame ID mapping for the frames in the images directory. This is an optional file and is useful if the data needs to be split into N folds sequence wise. In case the data is to be split into a random 80:20 train:val split, then this file may be ignored.
For DetectNet_v2, the train
tool does not support
training on images of multiple resolutions, or resizing images during training. All of the
images must be resized offline to the final training size and the corresponding bounding boxes
must be scaled accordingly. Online resizing is supported for other detection model architectures.
Label Files
A KITTI format label file is a text file containing one line per object. Each line has multiple fields. Here is a description of these fields:
Num elements |
Parameter name |
Description |
Type |
Range |
Example |
---|---|---|---|---|---|
1 | Class names | The class to which the object belongs. | String | N/A | Person, car, Road_Sign |
1 | Truncation | How much of the object has left image boundaries. | Float | 0.0, 0.1 | 0.0 |
1 | Occlusion | Occlusion state [ 0 = fully visible, 1 = partly visible, 2 = largely occluded, 3 = unknown]. | Integer | [0,3] | 2 |
1 | Alpha | Observation Angle of object | Float | [-pi, pi] | 0.146 |
4 | Bounding box coordinates: [xmin, ymin, xmax, ymax] | Location of the object in the image | Float(0 based index) | [0 to image width],[0 to image_height], [top_left, image_width], [bottom_right, image_height] | 100 120 180 160 |
3 | 3-D dimension | Height, width, length of the object (in meters) | Float | N/A | 1.65, 1.67, 3.64 |
3 | Location | 3-D object location x, y, z in camera coordinates (in meters) | Float | N/A | -0.65,1.71, 46.7 |
1 | Rotation_y | Rotation ry around the Y-axis in camera coordinates | Float | [-pi, pi] | -1.59 |
The sum of the total number of elements per object is 15. Here is a sample text file:
car 0.00 0 -1.58 587.01 173.33 614.12 200.12 1.65 1.67 3.64 -0.65 1.71 46.70 -1.59
cyclist 0.00 0 -2.46 665.45 160.00 717.93 217.99 1.72 0.47 1.65 2.45 1.35 22.10 -2.35
pedestrian 0.00 2 0.21 423.17 173.67 433.17 224.03 1.60 0.38 0.30 -5.87 1.63 23.11 -0.03
This indicates that in the image there are 3 objects with parameters as mentioned above. For detection the Toolkit only requires the class name and bbox coordinates fields to be populated. This is because the TAO training pipe supports training only for class and bbox coordinates. The remaining fields may be set to 0. Here is a sample file for a custom annotated dataset:
car 0.00 0 0.00 587.01 173.33 614.12 200.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00
cyclist 0.00 0 0.00 665.45 160.00 717.93 217.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00
pedestrian 0.00 0 0.00 423.17 173.67 433.17 224.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Sequence Mapping File
This is an optional JSON file that captures the mapping between the frames in the images
directory and the names of video sequences from which these frames were extracted. This
information is needed while doing an N-fold split of the dataset. This way frames from one
sequence don’t repeat in other folds and one of the folds could be used for validation. Here’s
an example of the JSON dictionary file.
{
"video_sequence_name": [list of strings(frame idx)]
}
Here’s an example of a kitti_seq_to_frames.json
file with a sample dataset with six
sequences:
{
"2011_09_28_drive_0165_sync": ["003193", "003185", "002857", "001864", "003838",
"007320", "003476", "007308", "000337", "004165", "006573"],
"2011_09_28_drive_0191_sync": ["005724", "002529", "004136", "005746"],
"2011_09_28_drive_0179_sync": ["005107", "002485", "006089", "000695"],
"2011_09_26_drive_0079_sync": ["005421", "000673", "002064", "000783", "003068"],
"2011_09_28_drive_0035_sync": ["005540", "002424", "004949", "004996", "003969"],
"2011_09_28_drive_0117_sync": ["007150", "003797", "002554", "001509"]
}
Since TAO 3.0-22.05, all object detection models support COCO format. Using the COCO format requires data to be organized in this structure:
annotation{
"id": int,
"image_id": int,
"category_id": int,
"bbox": [x,y,width,height],
"area": float,
"iscrowd": 0 or 1,
}
image{
"id": int,
"width": int,
"height": int,
"file_name": str,
"license": int,
"flickr_url": str,
"coco_url": str,
"date_captured": datetime,
}
categories[{
"id": int,
"name": str,
"supercategory": str,
}]
An example COCO annotation file is shown below:
"annotations": [{"area": 702.1057499999998,"iscrowd": 0,"image_id": 289343,"bbox": [473.07,395.93,38.65,28.67],"category_id": 18,"id": 1768}],
"images": [{"license": 1,"file_name": "000000407646.jpg","coco_url": "http://images.cocodataset.org/val2017/000000407646.jpg","height": 400,"width": 500,"date_captured": "2013-11-23 03:58:53","flickr_url": "http://farm4.staticflickr.com/3110/2855627782_17b93a684e_z.jpg","id": 407646}],
"categories": [{"supercategory": "person","id": 1,"name": "person"},{"supercategory": "vehicle","id": 2,"name": "bicycle"},{"supercategory": "vehicle","id": 3,"name": "car"},{"supercategory": "vehicle","id": 4,"name": "motorcycle"}]
See the COCO website for a description of the COCO format.
Start the id
in categories
from 1.
The ODVG file is in JSONL format, with one JSON object per line. The ODVG supports either object detection/segmentation tasks or visual grounding tasks.
For the object detection/segmentation tasks, use the detection
field to encode instances
, which is a list of dictionaries consist of :code:`bbox, label
, category
and mask
.
An additional label map file is usually required to map category
to class IDs in the corresponding networks.
{"file_name": "000000391895.jpg",
"height": 360,
"width": 640,
"detection":
{"instances":
[
{"bbox": [359.17, 146.17, 471.62, 359.74],
"label": 3,
"category": "motorcycle",
"mask": [[376.97, 176.91, 398.81, 176.91, 401.24, 312.82, 370.49, 303.92, 391.53, 299.87, 391.73, 184.19]]},
{"bbox": [339.88, 22.16, 493.76, 322.89],
"label": 0,
"category": "person",
"mask": [[352.55, 146.82, 15, 376.15, 164.22, 377.2, 160.35, 378.61, 151.9, 377.55,]]}
]}
}
An example of the label map file is shown below:
{"0": "person", "1": "bicycle", "2": "car", "3": "motorcycle"}
For the visual grounding tasks, use the grounding`
field to encode caption
and regions
. caption
is the global caption for an image. regions
is a list of dictionaries consist of phrase
and bbox
{"file_name": "000000000072.jpg",
"height": 640,
"width": 427,
"grounding":
{"caption": "giraffe on left and RIGHT GIRAFFE",
"regions": [{"bbox": [50.45, 72.07, 283.96, 640.0], "phrase": "giraffe"},
{"bbox": [136.63, 129.44, 425.71, 628.49], "phrase": "giraffe"}]
}}
Using the COCO format requires data to be organized in this structure:
annotation{
"id": int,
"image_id": int,
"category_id": int,
"segmentation": RLE or [polygon],
"area": float,
"bbox": [x,y,width,height],
"iscrowd": 0 or 1,
}
image{
"id": int,
"width": int,
"height": int,
"file_name": str,
"license": int,
"flickr_url": str,
"coco_url": str,
"date_captured": datetime,
}
categories[{
"id": int,
"name": str,
"supercategory": str,
}]
An example COCO annotation file is shown below:
"annotations": [{"segmentation": [[510.66,423.01,511.72,420.03,510.45,416.0,510.34,413.02,510.77,410.26,510.77,407.5,510.34,405.16,511.51,402.83,511.41,400.49,510.24,398.16,509.39,397.31,504.61,399.22,502.17,399.64,500.89,401.66,500.47,402.08,499.09,401.87,495.79,401.98,490.59,401.77,488.79,401.77,485.39,398.58,483.9,397.31,481.56,396.35,478.48,395.93,476.68,396.03,475.4,396.77,473.92,398.79,473.28,399.96,473.49,401.87,474.56,403.47,473.07,405.59,473.39,407.71,476.68,409.41,479.23,409.73,481.56,410.69,480.4,411.85,481.35,414.93,479.86,418.65,477.32,420.03,476.04,422.58,479.02,422.58,480.29,423.01,483.79,419.93,486.66,416.21,490.06,415.57,492.18,416.85,491.65,420.24,492.82,422.9,493.56,424.39,496.43,424.6,498.02,423.01,498.13,421.31,497.07,420.03,497.07,415.15,496.33,414.51,501.1,411.96,502.06,411.32,503.02,415.04,503.33,418.12,501.1,420.24,498.98,421.63,500.47,424.39,505.03,423.32,506.2,421.31,507.69,419.5,506.31,423.32,510.03,423.01,510.45,423.01]],"area": 702.1057499999998,"iscrowd": 0,"image_id": 289343,"bbox": [473.07,395.93,38.65,28.67],"category_id": 18,"id": 1768}],
"images": [{"license": 1,"file_name": "000000407646.jpg","coco_url": "http://images.cocodataset.org/val2017/000000407646.jpg","height": 400,"width": 500,"date_captured": "2013-11-23 03:58:53","flickr_url": "http://farm4.staticflickr.com/3110/2855627782_17b93a684e_z.jpg","id": 407646}],
"categories": [{"supercategory": "person","id": 1,"name": "person"},{"supercategory": "vehicle","id": 2,"name": "bicycle"},{"supercategory": "vehicle","id": 3,"name": "car"},{"supercategory": "vehicle","id": 4,"name": "motorcycle"}]
See the COCO website for a description of the COCO format.
Start the id
in categories
from 1.
This section describes the dataset formats supported by UNet and SegFormer for loading images and masks.
If you have the masks saved in COCO format, refer to :ref: sample_usage_of_the_dataset_converter_tool_unet<Dataset Converter> to convert COCO format to UNet mask image format.
Semantic Segmentation Mask Format
This section describes the format of the mask images for different types of input_image_type
or input_type
.
Refer dataset_config_unet for more information on configuring the input_image_type
for UNet.
Refer dataset_config_segformer for more information on configuring the input_type
for SegFormer.
Color/ RGB Input Image Type
For color/ RGB input images, each mask image is a single-channel or three-channel image with the size
equal to the input image. Every pixel in the mask must have an integer value that represents the
segmentation class label_id
, as per the mapping provided in the dataset_config_unet and dataset_config_segformer.
Ensure that the value of the pixels in the mask image are within the range of the label_id
values provided in the dataset_config and dataset_config_segformer.
For a reference example, refer to the _labelIds.png
images format in the Cityscapes Dataset.
Grayscale Input Image Type
For grayscale input images, the mask is a single channel image with size equal to the input image.
Every pixel has a value of 255 or 0, which corresponds respectively to a label_id
of 1 or 0
in the dataset_config and dataset_config_segformer. For reference, refer to ISBI dataset Jupyter
Notebook example provided in ngcresources.
Image and Mask Loading Format
SegFormer
For SegFormer, the path to images and mask folders can directly be provided in the dataset_config_segformer. Ensure that the image and the corresponding mask names are same. The image and mask extension don’thave to be the same.
UNet
Structured Images and Masks Folders for UNet
The data folder structure for images and masks must be in the following format for UNet:
/Dataset_01
/images
/train
0000.png
0001.png
...
...
N.png
/val
0000.png
0001.png
...
...
N.png
/test
0000.png
0001.png
...
...
N.png
/masks
/train
0000.png
0001.png
...
...
N.png
/val
0000.png
0001.png
...
...
N.png
See the Folders based Dataset Config section for further details about configuring these image and mask folder paths in experiment spec.
Each image and label has the same file ID before the extension. The image-to-label correspondence is maintained using this filename. The
test
folder in the above directory structure is optional; any folder can be used for inference.
Image and Mask Text Files for UNet
An image text file must contain the full abosolute UNIX paths to all the images. A mask text file must contain the full abosolute UNIX paths to the corresponding mask files.
Contents of an example images text file images_source1.txt
follows:
/home/user/workspace/exports/images_final/00001.jpg
/home/user/workspace/exports/images_final/00002.jpg
Contents of example corresponding mask text file labels_source1.txt
is shown below. It contains the corresponding mask names:
/home/user/workspace/exports/masks_final/00001.png
/home/user/workspace/exports/masks_final/00002.png
Text file method additionally allows you to specify multiple sequences.
Provide these text file paths in a spec file.
See Text files based Dataset Config section for further details about configuring multiple data sources using text files in dataset config.
The size of the images doesn’t need be equal to the model input dimensions. The images are resized internally to model input dimensions.
A gesture recognition model should perform well on users outside the training dataset. Thus, model training requires user segregation when splitting into train, validation, and test datasets. To enable this a unique identifier, user_id, is required for each subject. In addition, record multiple videos for each subject.
Organize the dataset in the following format:
.
|-- original dataset root
|-- uid_1
|-- session_1
|-- 000000.png
|-- 000001.png
.
.
|-- xxxxxx.png
|-- session_2
|-- 000000.png
|-- 000001.png
.
.
|-- xxxxxx.png
|-- uid_2
|-- session_1
|-- 000000.png
|-- 000001.png
.
.
|-- xxxxxx.png
|-- session_2
|-- 000000.png
|-- 000001.png
.
.
|-- xxxxxx.png
|-- uid_3
|-- session_1
|-- 000000.png
|-- 000001.png
.
.
|-- xxxxxx.png
For each set, prepare a metadata file that captures fields that can be used for dataset sampling.
{
"set": "data",
"users": {
"uid_1": {
"location": "outdoor",
"illumination": "good",
"class_fps": {
"session_1": 30,
"session_2": 30
}
},
"uid_2": {
"location": "indoor",
"illumination": "good",
"class_fps": {
"session_1": 10,
"session_2": 15
}
},
"uid_3": {
"location": "indoor",
"illumination": "poor",
"class_fps": {
"session_1": 10
}
}
}
}
Label Format
Each image corresponds to a subject performing a gesture. The image requires a corresponding label JSON which contains a bounding box for the hand of interest and gesture label. These follow the Label Studio format. A sample label for an image is:
{
"completions": [
{
"result": [
{
"type": "rectanglelabels",
"original_width": 320,
"original_height": 240,
"value": {
"x": 58.1,
"y": 18.3,
"width": 18.8,
"height": 49.5
}
},
{
"type": "choices",
"value": {
"choices": [
"Thumbs-up"
]
}
}
]
}
],
"task_path": "/workspace/tao-experiments/gesturenet/data/uid_1/session_1/image_0001.png"
}
task_path
: Specifies the full path to the image.completions
: This is a chunk that contains the labels under results.
The bounding box and gesture class are separate entries with the following type.
rectanglelabels
: specifies the label corresponding to hand bounding box.
Parameter name |
Description |
Type |
Range |
---|---|---|---|
type | The type of label | String | rectanglelabels |
original_width | Width of image being labelled (in pixels) | Integer | [1, inf) |
original_height | Height of image being labelled (in pixels) | Integer | [1, inf) |
value[“x”] | x coordinate of top left corner of hand bounding box (as a percentage of image width) | Float | [0, 100] |
value[“y”] | y coordinate of top left corner of hand bounding box (as a percentage of image height) | Float | [0, 100] |
value[“width”] | Width of the hand bounding box (as a percentage of image width) | Float | [0, 100] |
value[“height”] | Height of the hand bounding box (as a percentage of image height) | Float | [0, 100] |
choices
: specifies the label corresponding to gesture class.
Parameter name |
Description |
Type |
Range |
---|---|---|---|
type | The type of label | String | choices |
value[“choices”] | List of attributes. For GestureNet app this will be a single entry with gesture class name | List of strings | Valid gesture classes |
The dataset_convert
tool requires an extraction and experiment configuration spec files input. The details of the
configuration files and sample usage examples are included on the Gesture Recognition page.
HeartRateNet expects directories of images in the format shown below. The images and ground truth labels are then converted to TFRecords for training.
Subject_001/
ground_truth.csv
image_timestamps.csv
images/
0000.bmp
0001.bmp
.
.
N.bmp
.
.
Subject_M/
ground_truth.csv
image_timestamps.csv
images/
0000.bmp
0001.bmp
.
.
Y.bmp
EmotionNet, FPENet, and GazeNet use the same JSON data format labeled by the NVIDIA data factory team. These apps expect data in this JSON data format for training and evaluation. For EmotionNet, FPENet, and GazeNet, this data is converted to TFRecords for training. TFRecords help iterate faster through the data. See the corresponding section for the JSON data format descriptions.
Using the JSON Label data format requires that data be organized in a JSON file with the following structure:
.
{
"filename": "data/001_01_02_200_06.png",
"class": "image",
"annotations": [
{
"class": "FaceBbox",
"tool-version": "1.0",
"Occlusion": 0,
"face_outer_bboxx": 269.0082935424086,
"face_outer_bboxy": 44.33839032556304,
"face_outer_bboxwidth": 182.97858097042064,
"face_outer_bboxheight": 276.28773076003836,
"face_tight_bboxx": 269.211755426433,
"face_tight_bboxy": 147.9049289218409,
"face_tight_bboxwidth": 182.58110482105968,
"face_tight_bboxheight": 172.5088694283426
},
{
"class": "FiducialPoints",
"tool-version": "1.0",
"P1x": 304.8502837500011,
"P1y": 217.10946645000078,
"P2x": 311.0173699500011,
"P2y": 237.15249660000086,
.
.
"P26occluded": true,
"P46occluded": true,
.
.
"P68x": 419.5885050000024,
"P68y": 267.6976650000015,
.
.
"P104x": 429.6,
"P104y": 189.5,
},
{
"class": "eyes"
"tool-version": "1.0",
"l_eyex": 389.1221901922325,
"l_eyey": 197.94528259092206,
"r_eyex": 633.489814294182,
"r_eyey": 10.52527209626886,
"l_status": "open",
"r_status": "occluded",
}
]
}
Here’s a description of the structure:
filename
field: specifies the path to the images to train on.class
field: category of the labels for the respective section.annotation
field: annotation chunk.
There are three supported chunk types in the annotation including FaceBbox, FiducialPoints, and eyes.
FaceBox
chunk: This is a chunk that describe Face Bounding Box labeling information.
Parameter name |
Description |
Type |
Range |
Example |
---|---|---|---|---|
class | The class for the annotation chunk | String | N/A | FaceBbox |
tool-version |
Version of the labeling tool for this chunk | Float | N/A | 1.0 |
Occlusion |
Occlusion state [ 0 = not occlused, 1 = occluded ] | Integer | 0 or 1 | 0 |
face_outer_bboxx |
x coordinate of top left corner of outer face bounding box | Float | [0, image_width] | 269.05 |
face_outer_bboxy |
y coordinate of top left corner of outer face bounding box | Float | [0, image_height] | 44.33 |
face_outer_bboxwidth |
Width of the outer face bounding box | Float | [0, image_width] | 182.97 |
face_outer_bboxheight |
Height of the outer face bounding box | Float | [0, image_height] | 276.28 |
face_tight_bboxx |
x coordinate of top left corner of tight face bounding box | Float | [0, image_width] | 269.21 |
face_tight_bboxy |
y coordinate of top left corner of outer face bounding box | Float | [0, image_height] | 147.90 |
face_tight_bboxwidth |
Width of the outer face bounding box | Float | [0, image_width] | 182.58 |
face_tight_bboxheight |
Height of the outer face bounding box | Float | [0, image_height] | 172.50 |
FiducialPoint
chunk: This is a chunk that describes Fiducial Point Labeling information.
Parameter name |
Description |
Type |
Range |
Example |
---|---|---|---|---|
class | The class for the annotation chunk | String | N/A | FaceBbox |
tool-version |
Version of the labeling tool for this chunk | Float | N/A | 1.0 |
Occlusion |
Occlusion status [ 0 = not occlused, 1 = occluded ] | Integer | 0 or 1 | 0 |
Pix |
x coordinate of the ith landmarks point | Float | [0, image_width] | 304.85 |
Piy |
y coordinate of the ith landmarks point | Float | [0, image_height] | 217.10 |
Pioccluded |
Width of the outer face bounding box | String | N/A | true |
eyes
chunk: This is a chunk that describes eyes labeling information. This chunk is not required.
Parameter name |
Description |
Type |
Range |
Example |
---|---|---|---|---|
class | The class for the annotation chunk | String | N/A | FaceBbox |
tool-version |
Version of the labeling tool for this chunk | Float | N/A | 1.0 |
l_eyex |
x coordinate of left eye center | Float | [0, image_width] | 389.12 |
l_eyey |
y coordinate of left eye center | Float | [0, image_height] | 197.94 |
r_eyex |
x coordinate of right eye center | Float | [0, image_width] | 633.48 |
r_eyey |
y coordinate of right eye center | Float | [0, image_height] | 182.97 |
l_status |
Status of the left eye | String | open/close/barely open/half open/occluded | open |
r_status |
Status of the right eye | String | open/close/barely open/half open/occluded | occluded |
Here’s an example of a JSON file with a sample dataset with two image frames:
[
{
"filename": "data/001_01_02_200_06.png",
"class": "image",
"annotations": [
{
"face_outer_bboxy": 44.33839032556304,
"face_outer_bboxx": 269.0082935424086,
"face_tight_bboxx": 269.211755426433,
"face_tight_bboxy": 147.9049289218409,
"tool-version": "1.0",
"face_tight_bboxwidth": 182.58110482105968,
"face_tight_bboxheight": 172.5088694283426,
"face_outer_bboxwidth": 182.97858097042064,
"Occlusionx": 0,
"class": "FaceBbox",
"face_outer_bboxheight": 276.28773076003836
},
{
"P91x": 395.3500000000004,
"P91y": 196.6500000000002,
"P74occluded": true,
"P28x": 436.44144340908053,
"P28y": 174.67157210032852,
"P52y": 252.53100000000143,
"P52x": 428.9925000000024,
"P32y": 236.48449500000103,
"P32x": 416.6063550000018,
"P44x": 427.65443026467267,
"P44y": 186.9615161604129,
"P99x": 425.75,
"P36occluded": true,
"P75x": 428.85,
"P75y": 190.95000000000002,
"P20x": 389.46879000000166,
"P20y": 178.13376000000076,
"P8y": 313.8318038340011,
"P8x": 407.70466707150143,
"P81y": 192.2500000000002,
"P94x": 427.70000000000005,
"P81x": 393.5500000000004,
"P12y": 268.179948238501,
"P12x": 408.69280247400155,
"P65y": 260.04348000000147,
"P65x": 429.0319800000024,
"P84x": 396.8500000000004,
"P84y": 194.4500000000002,
"P93occluded": true,
"P46occluded": true,
"P43y": 193.31428917697824,
"P43x": 421.12354211680173,
"P14occluded": true,
"P92y": 187.5,
"P54occluded": true,
"P53x": 433.50450000000245,
"P53y": 251.9670000000014,
"P45occluded": true,
"P33x": 426.3480450000019,
"P33y": 238.67140500000104,
"P60x": 413.82301500000233,
"P100occluded": true,
"P60y": 272.07148500000153,
"P23y": 174.7903155211989,
"P23x": 428.12940394815394,
"P90y": 194.9000000000002,
"P13x": 399.2067026100015,
"P13y": 257.903340052501,
"P7x": 388.1395861020014,
"P7y": 304.93858521150105,
"P61y": 262.1309850000015,
"P104x": 429.6,
"P104y": 189.5,
"P83y": 193.2500000000002,
"P83x": 395.0000000000004,
"P61x": 404.5783500000023,
"P50y": 254.6756100000014,
"P50x": 414.2206350000023,
"P100x": 424.8,
"P100y": 191.3,
"P34y": 240.46069500000107,
"P34x": 435.9903300000019,
"P18y": 188.2730700000008,
"P18x": 366.50623500000154,
"P25occluded": true,
"P102occluded": true,
"P46x": 436.0852131464696,
"P46y": 191.82999641609848,
"P58y": 275.0536350000016,
"P58x": 429.2307900000024,
"P77x": 306.5418228495726,
"P77y": 258.61884245799524,
"P97occluded": true,
"P99y": 192.9,
"P10y": 293.87146870350114,
"P10x": 434.97720418050164,
"P48occluded": true,
"P26x": 436.0258414360342,
"P26y": 171.99984513074497,
"version": "v1",
"P27occluded": true,
"P86x": 397.8000000000004,
"P86y": 198.45000000000022,
"P73occluded": true,
"P98occluded": true,
"P2y": 237.15249660000086,
"P90x": 393.3500000000004,
"P29y": 203.3826300000009,
"P29x": 433.6046100000019,
"P101y": 188.85000000000002,
"P101x": 425.65000000000003,
"P51x": 423.6641100000023,
"P51y": 252.5881050000014,
"P35x": 436.78557000000194,
"P35y": 239.26783500000104,
"P66x": 433.70401500000247,
"P66y": 268.0952850000015,
"P19x": 378.4348350000016,
"P19y": 181.61293500000076,
"P98y": 193.45000000000002,
"P98x": 427.85,
"P45y": 187.0802595812833,
"P45x": 433.2353710455805,
"P21y": 176.44387500000076,
"P21x": 398.1170250000017,
"P59x": 422.1730350000024,
"P59y": 274.25839500000154,
"P9x": 431.0246625705015,
"P9y": 312.25078719000106,
"P17occluded": true,
"P11x": 422.7243251895016,
"P11y": 281.81621679300105,
"P70y": 195.95000000000002,
"P79occluded": true,
"P95occluded": true,
"P70x": 395.20000000000005,
"P1x": 304.8502837500011,
"P13occluded": true,
"P85y": 196.6500000000002,
"P85x": 398.1000000000004,
"P69y": 196.95000000000002,
"P24x": 433.0572559142747,
"P36y": 236.88211500000105,
"P36x": 427.5409050000019,
"P94occluded": true,
"P104occluded": true,
"P47occluded": true,
"P40x": 401.35650000000186,
"P40y": 197.40000000000092,
"P71x": 396.40000000000003,
"P71y": 196.8,
"P65occluded": true,
"P26occluded": true,
"P56y": 273.06553500000155,
"P56x": 433.0081800000024,
"P16occluded": true,
"P89y": 196.2500000000002,
"P89x": 392.4500000000004,
"P48x": 428.54500592120047,
"P48y": 195.45167075264504,
"P16y": 216.4016531475008,
"P16x": 360.47179483200136,
"P15occluded": true,
"P24y": 170.63429579073562,
"P78x": 276.3975906000002,
"class": "FiducialPoints",
"P74y": 190.10000000000002,
"P4y": 270.1562190435009,
"P4x": 329.2467161130011,
"P96y": 191.10000000000002,
"P74x": 427.85,
"P103y": 195.00000000000003,
"P103x": 396.4500000000001,
"P80x": 330.41417158035716,
"P80y": 178.5832276794402,
"P37x": 381.05250000000177,
"P37y": 200.64300000000094,
"P47y": 195.09544049003392,
"P47x": 433.47285788732125,
"P64x": 432.80937000000245,
"P64y": 255.47085000000143,
"P76y": 191.60000000000002,
"P57y": 271.77327000000156,
"P99occluded": true,
"P43occluded": true,
"P88x": 392.8500000000004,
"P88y": 198.45000000000022,
"P17x": 335.9660368500013,
"P17y": 206.7179262030008,
"P96x": 431.05,
"P67y": 268.3935000000015,
"P27y": 173.42476618118954,
"P27x": 436.38207169864535,
"P87y": 199.45000000000022,
"P87x": 395.1000000000004,
"P3x": 316.76397300000116,
"P67x": 426.8450700000024,
"P96occluded": true,
"P12occluded": true,
"P97x": 430.35,
"P97y": 193.05,
"P101occluded": true,
"P55occluded": true,
"P93x": 429.05,
"P93y": 195.4,
"P42x": 388.6665000000018,
"P42y": 200.64300000000094,
"P79y": 238.89320075909347,
"P54y": 252.24900000000142,
"P54x": 431.5305000000024,
"P73x": 427.05,
"P73y": 191,
"P68y": 267.6976650000015,
"P30y": 214.61539500000094,
"P30x": 440.86117500000194,
"P14y": 243.47656317600092,
"P14x": 384.18704449200146,
"P63y": 254.87442000000144,
"P76occluded": true,
"P22x": 406.8646650000017,
"P22y": 176.94090000000077,
"P28occluded": true,
"P6y": 296.24299366950106,
"P6x": 367.5863697300013,
"P92x": 428.85,
"P38y": 193.3815000000009,
"P38x": 388.5255000000018,
"P94y": 188.5,
"P72y": 197.70000000000002,
"P72x": 395.65000000000003,
"P78y": 210.5218971000002,
"P63x": 427.8391200000024,
"P35occluded": true,
"P82x": 393.8000000000004,
"P82y": 200.95000000000022,
"P11occluded": true,
"tool-version": "1.0",
"P41y": 200.99550000000093,
"P41x": 396.5625000000018,
"P56occluded": true,
"P55x": 425.0508679558401,
"P55y": 259.9172483306748,
"P31x": 449.410005000002,
"P31y": 225.351135000001,
"P1y": 217.10946645000078,
"P75occluded": true,
"P62x": 420.38374500000236,
"P62y": 256.06728000000146,
"P15x": 373.5151821450014,
"P15y": 228.45690505800087,
"P49y": 261.4140000000014,
"P49x": 400.0875000000022,
"P25y": 170.87178263247637,
"P25x": 435.25400920037674,
"P2x": 311.0173699500011,
"P80occluded": true,
"P3y": 251.86940685000093,
"P39x": 397.33800000000184,
"P39y": 192.1830000000009,
"P69x": 394.6,
"P5x": 347.3103508088991,
"P5y": 287.4697160411496,
"P95x": 430,
"P95y": 189.25,
"P79x": 368.8999131564783,
"P57x": 434.7974700000025,
"P102x": 428.1,
"P102y": 190.85000000000002,
"P76x": 428.25
},
{
"l_eyex": 389.1221901922325,
"l_eyey": 197.94528259092206,
"tool-version": "1.0",
"l_status": "open",
"r_status": "occluded",
"r_eyex": 633.489814294182,
"r_eyey": 10.52527209626886,
"class": "eyes"
}
]
},
{
"filename": "data/001_03_01_130_05.png",
"class": "image",
"annotations": [
{
"face_outer_bboxy": 36.21548211860577,
"face_outer_bboxx": 259.54428851667467,
"face_tight_bboxx": 265.58020220310897,
"face_tight_bboxy": 116.19133846386018,
"tool-version": "1.0",
"face_tight_bboxwidth": 191.64025954428882,
"face_tight_bboxheight": 192.64624515869457,
"face_outer_bboxwidth": 198.68215884512887,
"Occlusionx": 0,
"class": "FaceBbox",
"face_outer_bboxheight": 273.62808711835464
},
{
"P91x": 283.35,
"P91y": 179.55,
"P28x": 304.14947850000084,
"P28y": 176.3226009000005,
"P5occluded": true,
"P52y": 244.28250000000094,
"P52x": 305.0535000000012,
"P32y": 220.38088500000066,
"P32x": 289.76557500000087,
"P44x": 334.8750000000012,
"P44y": 168.63600000000062,
"P99x": 340.20000000000005,
"P99y": 174.75,
"P75x": 343.90000000000003,
"P75y": 171.70000000000002,
"P20x": 269.9839800000006,
"P20y": 158.94859500000035,
"P8y": 299.437842994699,
"P8x": 301.7845345542186,
"P94x": 342.70000000000005,
"P12y": 272.68555921617576,
"P12x": 389.08146056834715,
"P65y": 249.500000000001,
"P65x": 321.9500000000013,
"P84x": 285.8,
"P84y": 175.5,
"P43y": 176.03850000000065,
"P43x": 329.9400000000012,
"P68x": 302.05,
"P68y": 252.55,
"P92y": 165.70000000000002,
"P92x": 343.40000000000003,
"P53x": 311.11650000000117,
"P53y": 241.18050000000093,
"P33x": 295.53106500000086,
"P33y": 224.95351500000066,
"P60x": 297.5100000000011,
"P60y": 258.382500000001,
"P23y": 149.55274915302633,
"P23x": 325.5457633496816,
"P90y": 177.15,
"P13x": 406.681647264744,
"P13y": 256.7280566114426,
"P7x": 292.6324374720922,
"P90x": 280.25,
"P58x": 309.7065000000012,
"P61y": 253.51800000000097,
"P104x": 346.0500000000002,
"P104y": 171.35000000000008,
"P83y": 174.70000000000002,
"P83x": 282.15000000000003,
"P61x": 296.31150000000116,
"P50y": 249.21750000000097,
"P50x": 294.19650000000115,
"P100x": 339.45000000000005,
"P100y": 171.75,
"P34y": 224.85411000000067,
"P34x": 300.8989350000009,
"P18y": 170.97660000000036,
"P18x": 268.59231000000057,
"P46x": 357.7170000000013,
"P46y": 172.86600000000064,
"P58y": 258.664500000001,
"P4occluded": true,
"P77x": 300.22496910000007,
"P77y": 221.17413690000006,
"tool-version": "1.0",
"P10y": 298.73383552684317,
"P10x": 341.67829106605154,
"P26x": 361.5470170921228,
"P26y": 148.3723801778643,
"version": "v1",
"P86x": 286.6,
"P86y": 181.9,
"P2y": 204.16216567820388,
"P2x": 300.6111887744588,
"P29y": 189.63790065000055,
"P29x": 301.90690170000084,
"P101y": 168.8,
"P101x": 340.40000000000003,
"P51x": 298.49700000000115,
"P51y": 243.57750000000092,
"P35x": 310.83943500000095,
"P35y": 223.36303500000068,
"P66x": 313.85,
"P66y": 251.05,
"P19x": 267.8964750000006,
"P19y": 165.11170500000037,
"P98y": 176.55,
"P98x": 342.85,
"P45y": 165.8865000000006,
"P45x": 347.2125000000013,
"P21y": 156.26466000000033,
"P21x": 276.4453050000006,
"P59x": 303.2910000000012,
"P59y": 258.664500000001,
"P9x": 316.33402222324,
"P9y": 303.89655695778623,
"P17occluded": true,
"P11x": 366.0838832850552,
"P11y": 286.0617011054374,
"P70y": 178.35000000000002,
"P70x": 283.15000000000003,
"P1x": 307.6512634530176,
"P1y": 189.14333969727852,
"P85y": 178.60000000000002,
"P85x": 287.5,
"P69y": 179.3,
"P69x": 282.65000000000003,
"P36y": 219.78445500000066,
"P36x": 326.446020000001,
"P77occluded": true,
"P81y": 173.25,
"P81x": 281.95,
"P40x": 298.00350000000105,
"P40y": 178.85850000000062,
"P71x": 283.95,
"P71y": 178.9,
"P56y": 254.505000000001,
"P56x": 323.24250000000126,
"P7y": 284.1843478578217,
"P89y": 179.85000000000002,
"P89x": 279.90000000000003,
"P48x": 338.96400000000125,
"P48y": 177.51900000000066,
"P16y": 205.1008423020117,
"P16x": 420.9964657778135,
"P24x": 338.1757113839151,
"P24y": 146.9559374076699,
"class": "FiducialPoints",
"P74y": 170.85000000000002,
"P4y": 234.6691559519585,
"P4x": 290.9897533804285,
"P96y": 172,
"P74x": 342.95000000000005,
"P3occluded": true,
"P78occluded": true,
"P103y": 179.4,
"P103x": 285.55,
"P80x": 444.78450000000055,
"P80y": 173.78250000000023,
"P37x": 275.44350000000094,
"P37y": 182.80650000000063,
"P47y": 176.95500000000064,
"P47x": 350.5965000000013,
"P64x": 313.45000000000124,
"P64y": 249.95000000000098,
"P76y": 172.65,
"P57y": 257.113500000001,
"P6occluded": true,
"P88x": 281.1,
"P88y": 182.60000000000002,
"P17x": 420.2924583099576,
"P17y": 187.96999391751874,
"P96x": 347.70000000000005,
"P67y": 252.45000000000002,
"P27y": 153.68404056609333,
"P27x": 370.04567371328926,
"P87y": 183.5,
"P87x": 283.95,
"P3x": 295.44846734351574,
"P67x": 307.95000000000005,
"P2occluded": true,
"P97x": 346.25,
"P97y": 175.35000000000002,
"P93x": 344.35,
"P93y": 177.8,
"P42x": 280.30800000000096,
"P42y": 185.34450000000064,
"P54y": 243.78900000000093,
"P54x": 321.0570000000012,
"P73x": 342.5,
"P73y": 171.95000000000002,
"P30y": 201.83191200000059,
"P30x": 298.26271440000085,
"P14y": 239.83187738290155,
"P14x": 416.77242097067824,
"P63y": 251.20000000000098,
"P63x": 307.7500000000012,
"P22x": 284.2983000000006,
"P22y": 157.95454500000034,
"P1occluded": true,
"P6y": 270.10419850070423,
"P6x": 289.34706928876477,
"P38y": 176.17950000000062,
"P38x": 277.06500000000096,
"P94y": 166.65,
"P72y": 180.35000000000002,
"P72x": 283.65000000000003,
"P78y": 176.32260090000005,
"P78x": 318.5860666500001,
"P82x": 285.6,
"P82y": 185.35000000000002,
"P32occluded": true,
"P41y": 183.37050000000062,
"P41x": 290.460000000001,
"P55x": 334.9455000000013,
"P55y": 251.89650000000097,
"P31x": 295.59965445000086,
"P31y": 213.04479600000062,
"P79y": 232.39037411700184,
"P62x": 301.6000000000012,
"P62y": 251.30000000000098,
"P15x": 420.05778915400566,
"P15y": 222.93569815436055,
"P49y": 256.690500000001,
"P49x": 291.65850000000114,
"P25y": 145.8936053300241,
"P25x": 350.45154872559993,
"P3y": 218.94632250317727,
"P39x": 286.30050000000097,
"P39y": 173.50050000000059,
"P5x": 288.8777309768609,
"P5y": 253.44268842811516,
"P95x": 346.6,
"P95y": 168.25,
"P79x": 431.0439612134159,
"P57x": 315.4875000000012,
"P102x": 343.1,
"P102y": 172,
"P76x": 343.20000000000005
},
{
"l_eyex": 289.90000000000003,
"l_eyey": 179.60000000000002,
"tool-version": "1.0",
"l_status": "open",
"r_status": "open",
"r_eyex": 337.4000000000001,
"r_eyey": 173.35000000000005,
"class": "eyes"
}
]
}
Using the COCO format requires data to be organized in this structure:
|-- dataset root
|-- train2017
|-- 000000001000.jpg
|-- 000000001001.jpg
.
.
|-- xxxxxxxxxxxx.jpg
|-- val2017
|-- 000000002000.jpg
|-- 000000002001.jpg
.
.
|-- xxxxxxxxxxxx.jpg
|-- annotations
|-- person_keypoints_train2017.json
|-- person_keypoints_val2017.json
You can choose to have a nested directory structure
for the train and test images, as long as you have a dataset root, and the filenames are adjusted accordingly in the
images->filename
field in annotations.
Label Files
This section outlines the COCO annotations dataset format that the data must be in for BodyPoseNet.
Although COCO annotations have more fields, only the attributes that are needed by BodyPoseNet
are mentioned here. You can use the exact same format as COCO. Use the following structure for the overall dataset structure (in a .json
file):
"images": [
{
"file_name": "000000001000.jpg",
"height": 480,
"width": 640,
"id": 1000
},
{
"file_name": "000000580197.jpg",
"height": 480,
"width": 640,
"id": 580197
},
...
],
"annotations": [
{
"segmentation": [[162.46,152.13,150.73,...173.92,156.23]],
"num_keypoints": 17,
"area": 8720.28915,
"iscrowd": 0,
"keypoints": [162,174,2,...,149,352,2],
"image_id": 1000,
"bbox": [115.16,152.13,83.23,228.41],
"category_id": 1,
"id": 1234574
},
...
],
"categories": [
{
"supercategory": "person",
"id": 1,
"name": "person",
"keypoints": [
"nose","left_eye","right_eye","left_ear","right_ear",
"left_shoulder","right_shoulder","left_elbow","right_elbow",
"left_wrist","right_wrist","left_hip","right_hip",
"left_knee","right_knee","left_ankle","right_ankle"
],
"skeleton": [
[16,14],[14,12],[17,15],[15,13],[12,13],[6,12],[7,13],[6,7],
[6,8],[7,9],[8,10],[9,11],[2,3],[1,2],[1,3],[2,4],[3,5],[4,6],[5,7]
]
}
]
The
images
section contains the complete list of images in the dataset with some metadata.
Image IDs need to be unique among other images.
Parameter name |
Description |
Type |
Range |
---|---|---|---|
file_name |
The path to the image | String | N/A |
height |
The height of the image | Integer | N/A |
width |
The width of the image | Float | N/A |
id |
The unique ID of the image | Integer | N/A |
The
annotations
section contains the labels for the images. Each entity is one annotation, and each image can have multiple annotations.
Parameter name |
Description |
Type |
Range |
---|---|---|---|
segmentation |
A list of polygons, which has a list of vertices for a given person/group. | List | N/A |
num_keypoints |
The number of keypoints that are labeled | Integer | [0, total_keypoints] |
area |
The area of the segmentation/bbox | Float | N/A |
iscrowd |
If 1, indicates that the annotation mask is for multiple people | Integer | [0, 1] |
keypoints |
A list of keypoints with the following format: [x1, y1, v1, x2, y2, v2 ...] , where x and y are pixel locations, and v is the
visibility/occlusion flag. |
List | N/A |
bbox |
The bbox of the object/person | List | N/A |
image_id |
The unique ID of the associated image | Integer | N/A |
category_id |
The object category (always 1 for person) |
Integer | 1 |
id |
The unique ID of the annotation | Integer | N/A |
The COCO dataset follows the following occlusion flag labeling format:
[visible: 2, occluded: 1, not_labeled: 0]
.The
categories
section contains the keypoint convention that is followed in the dataset.
Parameter name |
Description |
Type |
Range |
---|---|---|---|
supercategory |
The supercategory | String | person |
id |
The ID of the category | Integer | 1 |
name |
The name of the category | String | person |
keypoints |
The keypoint names and ordering convention as used in labeling | List | N/A |
skeleton |
A list of skeleton edges with the following format: [[j1, j2], [j2, j3] ...] , where j is the
keypoint/joint index. |
List | N/A |
For more details, see the COCO keypoint annotations file and COCO Keypoint Detection Task.
Using the Market-1501 format requires data to be organized in this structure:
|-- dataset root
|-- bounding_box_train
|-- 0002_c1s1_000451_03.jpg
|-- 0002_c1s1_000551_01.jpg
.
.
|-- 1500_c6s3_086567_01.jpg
|-- bounding_box_test
|-- 0000_c1s1_000151_01.jpg
|-- 0000_c1s1_000376_03.jpg
.
.
|-- 1501_c6s4_001902_01.jpg
|-- query
|-- 0001_c1s1_001051_00.jpg
|-- 0001_c2s1_000301_00.jpg
.
.
|-- 1501_c6s4_001877_00.jpg
The root directory of the dataset contains sub-directories for training, testing, and query.
Each sub-directory has the cropped images of different identities. For example, the image
0001_c1s1_01_00.jpg
is from the first sequence s1 of camera c1. 01 indicates the
first frame in the sequence c1s1
. 0001
is the unique ID assigned to the object.
The contents after the third _
are ignored. There is no label file required.
For more details, please refer to the Market-1501 Dataset.