PeopleNet

The PeopleNet models detect one or more physical objects from three categories within an image and return a box around each object, as well as a category label for each object. Three categories of objects detected by these models are:

  • persons

  • bags

  • faces

These models are based on NVIDIA DetectNet_v2 detector with ResNet34 as the feature extractor. This architecture, also known as GridBox object detection, uses bounding-box regression on a uniform grid on the input image. Gridbox system divides an input image into a grid which predicts four normalized bounding-box parameters (xc, yc, w, h) and confidence value per output class.

The raw normalized bounding-box and confidence detections need to be post-processed by a clustering algorithm such as DBSCAN or NMS to produce the final bounding-box coordinates and category labels.

Training algorithm

The training algorithm optimizes the network to minimize the localization and confidence loss for the objects. This model was trained using the DetectNet_v2 training app in TLT v3.0. The training is carried out in two phases. In the first phase, the network is trained with regularization to facilitate pruning. Following the first phase, we prune the network removing channels whose kernel norms are below the pruning threshold. In the second phase the pruned network is retrained.

For a quantized INT8 model, a third quantization-aware training (QAT) phase is carried out. Regularization is not included in second and third phase.

Intended use case

The primary use case intended for these models is detecting people in a color (RGB) image. The model can be used to detect people from photos and videos by using appropriate video or image decoding and pre-processing. As a secondary use case, the model can also be used to detect bags and faces from images or videos. However, these additional classes are not the main intended use for these models.

The datasheet for the model is captured in its model card hosted at NGC.