1. Introduction

In this tutorial, we’ll write about how neural networks process and recognize images. Neural networks are capable of solving various types of problems with images. For instance, some of the most popular are image classification and object detection. We’ll explore how neural networks solve these problems, explaining the process and its mechanics.

Firstly, we’ll introduce all the problems, and after that, we’ll explain each of them in more detail. Also, we’ll present some of the most popular applications and examples.

2. Neural Networks and Images

Neural networks have revolutionized the field of computer vision by enabling machines to recognize and analyze images. They have become increasingly popular due to their ability to learn complex patterns and features. Especially convolutional neural networks (CNN), are the most popular type of neural network used in image processing.

But also vision transformers (ViT) have become more and more popular in recent times due to the breakthrough achievements of generative pre-trained transformers (GPT) and other transformer-based architectures in natural language processing.

Generally speaking, neural networks process and recognize images in different ways. It depends on the network architecture and the problem we must solve. Some of the most common problems that neural networks solve with images include:

  • Image classification – includes assigning a label or category to an image. For example, whether an image contains a cat or a dog
  • Object detection – identifying and detecting objects within an image
  • Image segmentation – involves converting an image into a collection of regions of pixels represented by a mask or a labelled image
  • Image generation – generating new images based on certain criteria or characteristics

There are some other problems that neural networks solve with images, including image captioning, image restoration, landmark detection, human pose estimation, and style transfer, but we won’t cover them in this article.

3. Image Classification

Image classification is the most popular task in computer vision, where we train a neural network to assign a label or category to an input image. This can be accomplished using various techniques, but the most common are convolutional neural networks (CNN).

3.1. Convolutional Neural Networks

CNNs comprise multiple layers, including convolutional layers, pooling layers, and fully connected layers. The convolutional layers are the heart of the network and are responsible for learning features from the input image. Specifically, they apply a series of filters to the image, each capturing a particular pattern or feature, such as edges, textures, or shapes.

For example, in the image below, we see the matrix I to which we apply convolution with the filter K. It means that the filter K goes through the whole matrix I, and element-wise multiplication is applied between the corresponding elements of the matrix I and the filter K. After that, we sum the result of this element-wise multiplication into one number:

convolution

The ReLU activation function is commonly used after the convolutional layer, followed by a pooling layer. The pooling layer applies filters in the same way as the convolutional layer but only calculates the maximal or average item instead of convolution. In the image below, we can see the example of the convolutional layer, ReLU, and max pooling:

convolutional pooling

Over the years, several CNN architectures have been developed, each with its own unique features and advantages. Some of the most popular are:

  • VGG16
  • InceptionNet
  • ResNets
  • NFNets
  • EfficientNets and others

3.3. Vision Transformers

The key idea behind vision transformers is to apply the transformer architecture, originally designed for natural language processing tasks, to image processing tasks. The transformer architecture consists of self-attention mechanisms, which allow the model to attend to different parts of the input sequence when making predictions.

In image processing, the input to the transformer model is a sequence of image patches rather than the entire image. These patches are then processed by a series of transformer blocks, enabling the model to capture local and global information:

vit

Vision transformers have achieved state-of-the-art performance on benchmark datasets, including ImageNet and COCO. However, they typically require significantly more computational resources than traditional CNNs, which can make them less practical for certain applications.

4. Object Detection

Object detection is detecting objects within an image or video by assigning a class label and a bounding box. For example, it takes an image as input and generates one or more bounding boxes, each with the class label attached.

Object detection is a combination of two tasks:

  • Object localization
  • Image classification

Algorithms for object localization identify an object’s location in an image and indicate its position by drawing a box around it. These algorithms take an image containing one or more objects as input and provide the location of the objects by specifying the position, height, and width of the bounding boxes:

cat detection

4.1. Object Detection with Neural Networks

Similar to image classification, CNNs are commonly used for this task. We can train the CNN on a dataset of labelled images, each with bounding boxes and class labels identifying the objects in the image. During training, the network learns to identify and classify objects in the image and locate them using bounding boxes.

The most popular neural network architectures for object detection are:

  • You Only Look Once (YOLO)
  • Region-Based Convolutional Neural Networks (R-CNN, Fast R-CNN, etc.)
  • Single Shot Detector (SSD)
  • Retina-Net

4.2. You Only Look Once (YOLO)

YOLO is one of the most popular neural network architectures and object detection algorithms. The YOLO algorithm divides the input image into a grid and predicts bounding boxes and class probabilities for each grid cell. It predicts the class probabilities and locations of multiple objects in a single pass through the network, making it faster and more efficient than other object detection algorithms.

To filter out overlapping bounding boxes and select the most accurate one, it uses a technique called non-max suppression. Non-max suppression works by selecting the bounding box with the highest confidence score. After that, it removes any other boxes that overlap with it by more than a certain threshold:

non max suppression

YOLO has multiple versions, with each version improving upon the previous one. More information on YOLO algorithms can be found in our article here.

5. Image Segmentation

Neural networks are a popular tool for image segmentation, and there are several types of image segmentation that we can do using neural networks. Some of the most common types of image segmentation with neural networks are:

  • Semantic segmentation
  • Instance segmentation
  • Boundary detection
  • Panoptic segmentation

5.1. Semantic Segmentation

Semantic segmentation includes assigning a class label to every pixel in the image. Basically, it means that if there are two or more objects of the same class in the image, the semantic segmentation will return a single mask including all the objects of the same class:

semantic segmentation