1. Introduction
Many deep-learning frameworks provide us with intuitive interfaces to set the layers, tune the hyper-parameters and evaluate our models. But to discuss the results properly, and more importantly, to understand how the networks work, we need to be familiar with fundamental concepts.
In this tutorial, we’ll talk about Backpropagation (or Backprop) and Feedforward Neural Networks.
2. Feedforward Neural Networks
Feedforward networks are the quintessential deep learning models. They’re made up of artificial neurons that are organized in layers.
2.1. How Does an Artificial Neuron Work?
A neuron is a rule that transforms an input vector into the output signal we call the activation value :
(1)
where is the neuron’s bias, the are the weights specific to the neuron, and is the activation function such as ReLU or sigmoid. For example, here’s how the -th neuron in a layer computes its output after receiving two input values::
2.2. Propagating Forward
A layer is an array of neurons. A network can have any number of layers between the input and the output ones. For instance:
In the image, and denote the input, and the hidden neuron’s outputs, and and are the output values of the network as a whole. The values of the biases and will be adjusted during the training phase.
The defining characteristic of feedforward networks is that they don’t have feedback connections at all. All the signals go only forward, from the input to the output layers.
If we had even a single feedback connection (directing the signal to a neuron from a previous layer), we would have a Recurrent Neural Network.
2.3. Example
Let’s suppose we want to develop a classifier for detecting if there’s a dog in an image. To keep things simple, we’ll pretend that we can do that by inspecting only the values of two grey-scale pixels and ().
Let’s say that the network has only one hidden layer and that the inputs are and . Also, let’s suppose we use the identity function as the activation function:
To calculate the activation value , we apply the formula (1):
(2)
Since we’re using the identity function as :
(3)
We do the same for , , and . For the latter two, the inputs are the values of and .
3. Backpropagation
When training a neural network, the cost value quantifies the network’s error, i.e., its output’s deviation from the ground truth. We calculate it as the average error over all the objects in the training set, and our goal is to minimize it.
3.1. Cost Function
For example, let’s say we have a network that classifies animals either as cats or dogs. It has two output neurons and , where the former represents the probability that the animal is a cat and the latter that it’s a dog. Given an image of a cat, we expect and .
However, if the network outputs and , we can quantify our error on that image as the squared distance:
(4)
We compute , the cost for the entire dataset, as the average error over individual samples. So, if is the ground truth for the -th training sample (), and is our network’s output, the total cost is:
(5)
We use the cost to update the weights and biases so that the actual outputs get as close as possible to the desired values. To decide whether to increase or decrease a coefficient, we calculate its partial derivative using backpropagation. Let’s explain it with an example.
3.2. Partial Derivatives
Let’s say we have only one neuron in the input, hidden, and output layers:
where is the identity function.
To update the weights and biases, we need to see how reacts to small changes in those parameters. We can do that by computing the partial derivatives of with respect to them. But before that, let’s recap how the variables in our problem are related:
So, if we want to see how changing affects the cost function, we should compute the partial derivative by applying the chain rule of Calculus:
(6)
3.3. Example: Computing Partial Derivatives
In this example, we’ll solve Equation (6), focusing only on the weight to show how the calculation goes, but the method is the same for the other weight and the biases.
First, we calculate how the cost function varies with the output:
(7)
Then, we need to calculate how the output value is affected by small changes in . For this, we find the derivative of the activation function. Since we chose the identity function as activation function, the derivative is 1:
(8)
Now, the only term missing is the partial derivative of with respect to the weight. Since , the partial derivative will be:
(9)
Now we have all the terms and we can calculate how the cost function is affected by a change in the weight :
(10)
3.4. Backpropagation During Training
Let’s suppose that, at a certain moment during training, we have , the desired output , and the current output . Using backpropagation, we compute the partial derivative of :
(11)
Now, the last step is to update the weight by multiplying the calculated value with the learning rate , which we’ll set to in this example:
(12)
This is the partial derivative for only one sample. To get the derivative for the whole dataset, we should average the individual derivatives:
(13)
We can imagine the computational cost of doing all these calculations for thousands of parameters and samples, and only after that updating the weights and biases.
3.5. Training Variations
Instead of calculating the error for each sample and then computing the average, we can calculate the error for a smaller group, let’s say 1000 samples, and update the weights based on the cost for that group only. That is the mini-batch gradient descent.
Also, we have the stochastic gradient descent in which we update the weights and biases after calculating the error for each sample in the training set.
3.6. General Formula
We can define a general formula to calculate the derivative of a weight that connects the neuron in the layer to the neuron in the layer .
First, let’s remember that the activation value of a neuron in the layer is:
(14)
Visually:
For the general case, Equation (6) becomes:
(15)
Following the same approach, we can derive the general formula for the partial derivative of the bias units:
(16)
3.7. Backpropagation = Calculating Derivatives Training
We shouldn’t confuse the backpropagation algorithm with the training algorithms. Backpropagation is a strategy to compute the gradient in a neural network. The method that does the updates is the training algorithm. For example, Gradient Descent, Stochastic Gradient Descent, and Adaptive Moment Estimation.
Lastly, since backpropagation is a general technique for calculating the gradients, we can use it for any function, not just neural networks. Additionally, backpropagation isn’t restricted to feedforward networks. We can apply it to recurrent neural networks as well.
4. Conclusion
In this article, we explained the difference between Feedforward Neural Networks and Backpropagation. The former term refers to a type of network without feedback connections forming closed loops. The latter is a way of computing the partial derivatives during training.
When using a model after training, the input “flows” forward through the layers from the input to the output. But, while training the network using backpropagation, we update the parameters in the opposite direction: from the output layer to the input one.