Metric Learning Recognition (MLRecogNet) is a classifier that encodes the input image to embedding vectors and predicts their labels based on the embedding vectors in the reference space. MLRecogNet consists of two parts:

Trunk : A backbone network that encodes the input image to a feature vector.

Embedder: A fully connected layer that maps the feature vector to the embedding space.

The embedding space is a high-dimensional space where the distance between the embedding vectors of the same class is small and the distance between the embedding vectors of different classes is large. The embedder is trained to minimize the distance between the embedding vectors of the same class and maximize the distance between the embedding vectors of different classes. The embedding vectors of the query images are compared with the embedding vectors of the reference images to predict the labels of the query images.

The current supported trunk is ResNet, which is the most commonly used baseline for vision classification. And the current supported embedder is a one-layer MLP.

During training, evaluation, and inference, MLRecogNet requires a reference set and a query set for validation or test. The reference set consists of a collection of labeled images, while the query set refers to a group of unlabeled images–the goal is to predict the labels of the unlabeled images by comparing their similarity to the embedding vectors of the reference set generated by trained MLRecogNet.