Warning

The Few-Shot Product Recognition reference application, first introduced in v1.0, is slated for re-implementation. This update aims to significantly improve its configurability, customizability, and scalability. As a result, the application is provided “as-is” in its current form, without further support.

We encourage users to utilize the existing assets and documentation to understand our reference architecture. For a more detailed proof-of-concept or pilot evaluation of our few-shot product recognition approach, we highly recommend examining the two AI models powering the application. These models can be combined quickly with a vector database to create an effective few-shot recognition pipeline:

Retail Object Detection: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/retail_object_detection

Retail Object Recognition (Embedding): https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/retail_object_recognition

Quickstart

Overview

Few-Shot Product Recognition, also called Retail Item Recognition with Few-Shot Learning (FSL), is a reference application for video analytics applications for retail self-checkout. FSL detects and recognizes a retail item while it is being captured by a video camera and scanned by a barcode scanner. The application predicts the barcode of the item based on the video frame. If the prediction does not match the barcode scanner’s output, the application labels the item unknown and allows the user to add item image and the barcode scanner result to the application database. When the same item is captured again, the application has a higher chance of recognizing it. There are two main ideas behind this application:

A retail item is detected from the scene. The embedding of such detection is extracted and compared with a database of embeddings. The barcode associated with the closest embedding found is the predicted label of the item. The mismatch between the predicted label and the ground truth provided by the barcode scanner signals an alert.
The application logs all predictions. There are API’s to examine these logs, and to further improve the system with previous predictions.

A visual representation of the FSL pipeline provided above. Video feed gets into the VST microservice and the output from VST goes to the Perception (DeepStream). This microservice generates a class-agnostic bounding box and an embedding vector for each detected object. These metadata are pushed to the Redis Message Broker. The Similarity Search microservice grabs the metadata from the Message Broker and predict the class of each object by finding its closest match in the embedding database. Again this metadata is also sent to the Message Broker, where it flows in the ELK Stack.

The Embedding Generation and Recognition Evaluator (Model Assessor) are on-demand microservices. Embedding Generation computes the visual representation of a given image. And the Recognition Evaluator grabs the latest predictions and tags them. An “unseen” tag is important for the system as they are candidates to be used to improve the system.

Quick Deployment

The Few-Shot Product Recognition app does not have a Docker Compose Quickstart like other apps. The application is packaged as a Helm chart and can be deployed on Kubernetes clusters. Refer to the Production Deployment section for more information on Kubernetes deployment.

Explore the Results

The Few-Shot Product Recognition app does not come with a reference UI. However, it has several REST API endpoints to interact with the system. by leveraging those endpoints, you can build your own UI.

Following is an architecture suggestion and mockup of a possible UI:

One can connect to the web socket endpoint /results after calling /results/start and parsing its results.

To operate on the active learning part of the system, the /tune/results and /tune/feedback can be used. The first one brings the predictions that can be used to improve the system and the latter one is used to feedback the selected predictions back to the system. A mock design:

API Tutorial Notebook

The Few-Shot Product Recognition app is an example of how we can augment a self-checkout lane by incorporating computer vision into the process. Barcode scanners are better when combined with computer vision. For example, computer vision enables visual recognition of the items that the barcode scans.

We have provided a sample notebook that shows some basic functionalities of our system. It shows how to add a video source to the system such as a camera pointing out the barcode scanner. It also gives an example of initializing the similarity search database. Improving the system using the feedback mechanism is another part of the sample notebook.

Once the services have been deployed on K8s and are in a ready state, you can experiment with the content provided within the notebook.

To access the notebook:

Download the metropolis-apps-standalone-deployment package by following instructions in Quickstart Setup. Navigate to metropolis-apps-standalone-deployment/notebooks/v1.1_reference_apps. There is a simple script to run a Docker container that you can use or you may use your favorite container image that can get Python and Jupyter notebook working.
When the docker is started, hit <host_IP_address>:12345 on your browser. Note that, in the sample script, we mapped the running port to 12345. Update it if you map the running port of the notebook to something else.
This should take you to the Jupyter notebook landing page. Get the token from the terminal you used to run the sample script (default token is mdx). You may also run your own Docker container, and put it on the Jupyter landing page.
Click the work folder on the the left and start the notebook.

Note that the sample notebook assumes you have a running RTSP server. If you have a video to test it out, you can create an RTSP server streaming your test video and feed that into the system. There is a sample video included in metropolis-apps-data/videos/retail_object_h264.mp4 (to download metropolis-apps-data follow the instructions in Quickstart Setup).

NVStreamer is one of the ways of creating an RTSP for a video. To install NVStreamer use the command below:

helm install nvstreamer https://helm.ngc.nvidia.com/rxczgrvsg8nx/vst-1-0/charts/nvstreamer-0.2.5.tgz --username='$oauthtoken' --password=<YOUR API KEY>

Note

NVStreamer App is deployed as standalone chart with user input videos. Videos needs to be uploaded in NVStreamer UI http://<K8s_node_IP_address>:31000/.
See the NVStreamer for more information about NVStreamer.
The default value for the loop playback parameter is false. If you want to loop over the same video continuously, override the nv_streamer_loop_playback parameter to true. You can find more information how to override NVStreamer parameters in the Production Deployment sections of this document: NVStreamer.

Components

To further understand this reference application, here is a brief description of the key components.

Media Management

Perception (DeepStream)

Similarity Search

Recognition Evaluator

Embedding Generation

Web API

VST

This component is developed at NVIDIA and our system uses its many features. One of the first steps is to add a video source, in the form of a RTSP server, to the system. Even if there is an existing RTSP, it has to go through this video media service. Refer to the Video Storage Toolkit (VST) for more information.

Perception (DeepStream)

The Perception (DeepStream) component generates streaming perception metadata which is consumed by Metropolis microservices via the Redis message broker. The perception pipeline involves two AI models:

Retail Object Detection - Detecting retail items in a video frame with 2D bounding boxes, and,
Retail Object Recognition - Extracting the embedding of each bounding box buffer.

These messages act as input data to downstream Metropolis microservices. The messages have the following format:

{
  "version":"5.0",
  "id":"286045",
  "@timestamp":"2021-09-02T10:08:59.493Z",
  "sensorId":"Camera D5",
  "objects":["10037|1506|801|1620|897|Product"]
  "embeddings":[<2048 floats>]
 }

The key attributes of the message are:

version: The version of the schema

id: The video frame-ID

timestamp: The camera timestamp

sensorId: Unique sensor-ID

objects: List of objects detected in a video frame, with each object element having a unique object ID, bounding box coordinates and the object type.

embeddings: List of embedding vectors representing the detected objects.

For more information on the schema and contents of the sensor metadata, refer to the Protobuf Schema and JSON Schema sections.

Similarity Search

This component achieves two major tasks:

Caching detections coming from Perception (DeepStream) and
Predicting a detection’s class/label.

Caching enables the system to respond very quickly to events (i.e., barcode signal). Prediction of a detected object is based on searching a database of embeddings of labeled items. Refer to the Similarity Search for more information.

Recognition Evaluator

This component works as a tagger for the predictions. It runs periodically and evaluates the most recent predictions by comparing their metadata along with their barcode signal. If the barcode signal and the system’s prediction are matching, then everything is good, and this component tags this transaction as correct.

If they are not matching, then the similarity score is evaluated. If it is below a certain threshold, this is a wrong prediction. And if it is greater, then there would be a previously-not-seen item. These unseen tags are the interest of the system. As they can be used to improve the system. You can analyze these transactions and vet which of these go back to the system. This way, the system actively learns from its mistakes with very little human interference. Once the system is improved, the system will likely predict similar items correctly. This feedback loops continues so that the system continuously rectifies itself by actively learning new items.

Embedding Generation

This component is the service that computes the visual representations (i.e., embedding vectors) of input images. It can be either used in the bulk form, where several images and folders can be provided, and the service calculates the representations of these images and inserts them into the similarity search database. This scenario is good to initialize the similarity search database. Further, it also computes the representation of a given image. This scenario is used in the feedback part to improve the system.

Web API

This component is the entry gate to the system. It provides API endpoint to manage the system and even to further tune the system for better results using previous predictions. Refer to the API Tutorial Notebook for sample usage of those endpoints.

Redis Message Broker

Redis is the message broker used in our system. It is used for internal messaging between services as well as carrying over the outputs of the system to the ELK stack.

ELK Stack

Elasticsearch, Logstash, and Kibana (ELK) stack is used in our system for persistence. Every detection and prediction is recorded. These records can be analyzed later and evaluated for other purposes. Kibana can be employed for an alternative way of monitoring the system output.

MongoDB

This component records the images inserted into the similarity search database. It is also used to keep track of the evaluation results from the evaluation service.

Conclusion

Congratulations! You have successfully learned about the architecture & key microservices of the Few-Shot Product Recognition application.

We encourage you to explore the remaining reference applications provided as part of Metropolis Microservices. Below are additional resources:

Quickstart Setup - Guide to deploy all reference applications in the standalone mode via Docker Compose.
Production Deployment Setup - Guide to deploying Metropolis microservices in a Kubernetes environment.
FAQs- A list of commonly asked questions.