The model described in this card detects people in an image and provides a semantic segmentation mask. The semantic segmentation mask is comprised of every pixel that belongs to a person in the image.
PeopleSemSegNet is based on Unet, which has an encoder-decoder like architecture. Please refer to VanillaUnetDynamic for architecture details. Introduced in U-Net-Convolutional Networks for Biomedical Image Segmentation, UNet is a widely adopted network for performing semantic segmentation, which has applications in autonomous vehicles, industries, smart cities, etc. UNet is a fully convolutional network with an encoder that is comprised of convolutional layers and a decoder that is comprised of transposed convolutions or upsampling layers. It then predicts a class label for every pixel in the input image.
The training algorithm optimizes the network to minimize the cross-entropy loss of the class for every pixel. This model was trained using the UNet training app in TAO Toolkit v3.0.
The primary use case for the PeopleSemSegNet model is segmenting people in a color (RGB) image. The model can be used to segment people from photos and videos using appropriate video or image decoding and pre-processing. Note that PeoplSemSegNet performs semantic segmentation (i.e. it generates a single mask for all the people in an image) and does not distinguish between different person instances.
The datasheet for the model is captured in the model card hosted at NGC.