1. Overview
In this tutorial, we’ll explain the definitions of several object recognition tasks and illustrate their main differences. Furthermore, we’ll present the state-of-the-art machine learning models designed to address these computer vision tasks.
2. What Is Object Recognition?
Object recognition is a generic term to indicate a set of computer vision tasks for identifying objects in digital images. We can distinguish several tasks that belong to the object recognition field: image classification, object localization, object detection, semantic segmentation, and instance segmentation. Nowadays, machine learning and deep learning techniques are the best way to perform object recognition tasks.
3. Object Recognition Tasks
Let’s now analyze in detail the main object recognition tasks.
3.1. Image Classification
Image classification deals with assigning a single label to an image. Hence, an image classification system takes as input an image with one or more objects and returns a single label. For example, given an image depicting a room, an image classification model would return one of the following labels: “kitchen”, “bathroom”, “living room”.
Convolutional Neural Networks (CNNs) are the most performing solution for image classification. The most common CNN architectures are AlexNet, VGG-16, Inception-v3, Xception, and EfficientNet.
3.2. Object Localization
Object localization deals with localizing objects by indicating their bounding boxes. Hence, an object localization algorithm takes as input an image and returns a set of bounding boxes, each one representing the location of a detected object. The bounding box is generally defined by a point (the centre or the top-left coordinate), the width and the height of the rectangle. An object localization system doesn’t return any label related to the category of the object.
An example of object localization is shown in the following figure:
3.3. Object Detection
Object detection is the task of locating objects in an image with bounding boxes and assigning a class label for each detection. Hence, an object detection system takes as input an image, with one or more objects, and returns a set of bounding boxes and the corresponding class labels:
Nowadays, the most performing CNN architectures for object detection are Faster R-CNN, YOLO, EfficientDet and Single Shot Detector (SSD).
3.4. Semantic Segmentation
Semantic segmentation deals with assigning a class label for each pixel of an image. It is worth noting that semantic segmentation cannot separate different instances of the same class. In other words, if there are two or more objects of the same class in the image, the semantic segmentation system will return a single mask including all the objects of the same class:
The main deep learning models for semantic segmentation are Fully Convolutional Network (FCN), U-Net and DeepLab.
3.5. Instance Segmentation
Instance segmentation deals with classifying each pixel of an image and, unlike semantic segmentation, it can distinguish different instances of the same class. An instance segmentation system generates a mask for each instance of a particular class. Hence, it can be considered as the combination of semantic segmentation and object detection:
The most important deep learning models for instance segmentation are Mask R-CNN, MaskLab, and TensorMask.
4. Conclusion
In this article, we reviewed the main tasks of object recognition. We illustrated the differences among them and we mentioned the machine learning models designed to address these computer vision tasks.