1. Introduction

Many deep-learning frameworks provide us with intuitive interfaces to set the layers, tune the hyper-parameters and evaluate our models. But to discuss the results properly, and more importantly, to understand how the networks work, we need to be familiar with fundamental concepts.

In this tutorial, we’ll talk about Backpropagation (or Backprop) and Feedforward Neural Networks.

2. Feedforward Neural Networks

Feedforward networks are the quintessential deep learning models. They’re made up of artificial neurons that are organized in layers.

2.1. How Does an Artificial Neuron Work?

A neuron is a rule that transforms an input vector x_1, x_2, \ldots, x_n into the output signal we call the activation value h:

(1)   \begin{equation*}  h = \sigma(z) = \sigma \left(b + \sum_{i=1}^{n}w_i\cdot x_{i}\right) \end{equation*}

where b is the neuron’s bias, the w_i are the weights specific to the neuron, and \sigma is the activation function such as ReLU or sigmoid. For example, here’s how the j-th neuron in a layer computes its output after receiving two input values::

Artificial Neuron

2.2. Propagating Forward

A layer is an array of neurons. A network can have any number of layers between the input and the output ones. For instance:

forward propagation

In the image, x_1 and x_2 denote the input, h_1 and h_2 the hidden neuron’s outputs, and y_1 and y_2 are the output values of the network as a whole. The values of the biases b_{1} and b_{2} will be adjusted during the training phase.

The defining characteristic of feedforward networks is that they don’t have feedback connections at all. All the signals go only forward, from the input to the output layers.

If we had even a single feedback connection (directing the signal to a neuron from a previous layer), we would have a Recurrent Neural Network.

2.3. Example

Let’s suppose we want to develop a classifier for detecting if there’s a dog in an image. To keep things simple, we’ll pretend that we can do that by inspecting only the values of two grey-scale pixels x_{1} and x_{2} (x_1, x_2 \in [0, 255]).

Let’s say that the network has only one hidden layer and that the inputs are x_{1} =150 and x_{2}=34.  Also, let’s suppose we use the identity function x \mapsto x as the activation function:

forward propagation: example

To calculate the activation value h_{1}, we apply the formula (1):

(2)   \begin{equation*} h_{1}= \sigma(z_1) = \sigma(w_{1}x_{1}+w_{2}x_{2}+b_{1}) = \sigma(0.2 \cdot 150 + 0.5 \cdot 34 + 3) = \sigma(50) \end{equation*}

Since we’re using the identity function as \sigma:

(3)   \begin{equation*} h_{1} = 50 \end{equation*}

We do the same for h_{2}, y_{1}, and y_{2}. For the latter two, the inputs are the values of h_1 and h_2.

3. Backpropagation

When training a neural network, the cost value J quantifies the network’s error, i.e., its output’s deviation from the ground truth. We calculate it as the average error over all the objects in the training set, and our goal is to minimize it.

3.1. Cost Function

For example, let’s say we have a network that classifies animals either as cats or dogs. It has two output neurons y_1 and y_2, where the former represents the probability that the animal is a cat and the latter that it’s a dog. Given an image of a cat, we expect y_1 = 1 and y_2 = 0.

However, if the network outputs y_{1}=0.25  and y_{2}=0.65, we can quantify our error on that image as the squared distance:

(4)   \begin{equation*} (0.25-0)^{2} + (0.65-1)^{2} = 0.185 \end{equation*}

We compute J, the cost for the entire dataset, as the average error over individual samples. So, if \widehat{\mathbf{y}}_i=[\hat{y}_{i, 1}, \hat{y}_{i, 2}, \ldots, \hat{y}_{i, m}] is the ground truth for the i-th training sample (i=1,2,\ldots,n), and \mathbf{y}_i=[y_{i,1}, y_{i,2}, \ldots, y_{i,m}] is our network’s output, the total cost J is:

(5)   \begin{equation*} J = \frac{1}{n} \sum_{i=1}^{n} || \mathbf{y}_i - \widehat{\mathbf{y}}_i ||^2 = \frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{m} (y_{i,j} - \hat{y}_{i,j})^2 \end{equation*}

We use the cost to update the weights and biases so that the actual outputs get as close as possible to the desired values. To decide whether to increase or decrease a coefficient, we calculate its partial derivative using backpropagation. Let’s explain it with an example.

3.2. Partial Derivatives

Let’s say we have only one neuron in the input, hidden, and output layers:

backpropagation: example

where \sigma is the identity function.

To update the weights and biases, we need to see how J reacts to small changes in those parameters. We can do that by computing the partial derivatives of J with respect to them. But before that, let’s recap how the variables in our problem are related:

backpropagation: dependence tree

So, if we want to see how changing w_2 affects the cost function, we should compute the partial derivative by applying the chain rule of Calculus:

(6)   \begin{equation*}  \frac{\partial{J}}{\partial{w_{2}}} = \frac{\partial{z_{2}}}{\partial{w_{2}}} \;\frac{\partial{y}}{\partial{z_{2}}} \;\frac{\partial{J}}{\partial{y}} \end{equation*}

3.3. Example: Computing Partial Derivatives

In this example, we’ll solve Equation (6), focusing only on the weight w_{2} to show how the calculation goes, but the method is the same for the other weight and the biases.

First, we calculate how the cost function varies with the output:

(7)   \begin{equation*} \frac{\partial{J}}{\partial{y}}= \frac{\partial{(y-\hat{y}})^{2}}{\partial{y}}=2(y-\hat{y}) \end{equation*}

Then, we need to calculate how the output value is affected by small changes in z_{2}. For this, we find the derivative of the activation function. Since we chose the identity function as activation function, the derivative is 1:

(8)   \begin{equation*} \frac{\partial{y}}{\partial{z_{2}}} = \frac{\partial{\sigma}}{\partial{z_{2}}} = \frac{\partial z_2}{ \partial z_2} = 1 \end{equation*}

Now, the only term missing is the partial derivative of z_{2} with respect to the weight. Since y = \sigma(z_{2})= \sigma(w_{2} h + b_{2}) = w_2 h + b_2, the partial derivative will be:

(9)   \begin{equation*} \frac{\partial{z_{2}}}{\partial{w_{2}}} = \frac{\partial({w_{2}h+b_{2})}}{\partial{w_{2}}} = h \end{equation*}

Now we have all the terms and we can calculate how the cost function is affected by a change in the weight w_{2}:

(10)   \begin{equation*} \frac{\partial{J}}{\partial{w_{2}}} = h \sigma'(z_{2})2(y-\hat{y}) = h_{1} 2(y-\hat{y}) \end{equation*}

3.4. Backpropagation During Training

Let’s suppose that, at a certain moment during training, we have h=0.0125, the desired output \hat{y}=1, and the current output y=1.2. Using backpropagation, we compute the partial derivative of w_2:

(11)   \begin{equation*} \frac{\partial{J}}{\partial{w_{2}}} = 0.0125 \cdot 2 \;(1.2 -1) = 0.005 \end{equation*}

Now, the last step is to update the weight by multiplying the calculated value with the learning rate \eta, which we’ll set to 0.01 in this example:

(12)   \begin{equation*} w_{2} = w_{2} - \eta \frac{\partial{J}}{\partial{w_{2}}} =  0.1 - 0.01 \cdot 0.005 = 0.09995 \end{equation*}

This is the partial derivative for only one sample. To get the derivative for the whole dataset, we should average the individual derivatives:

(13)   \begin{equation*} \frac{1}{n} \sum_{k=0}^{n-1}\frac{\partial{J_{k}}}{w_{2}} \end{equation*}

We can imagine the computational cost of doing all these calculations for thousands of parameters and samples, and only after that updating the weights and biases.

3.5. Training Variations

Instead of calculating the error for each sample and then computing the average, we can calculate the error for a smaller group, let’s say 1000 samples, and update the weights based on the cost for that group only. That is the mini-batch gradient descent.

Also, we have the stochastic gradient descent in which we update the weights and biases after calculating the error for each sample in the training set.

3.6. General Formula

We can define a general formula to calculate the derivative of a weight w_{i,j}^{k} that connects the neuron j in the layer k to the neuron i in the layer k+1.

First, let’s remember that the activation value of a neuron in the layer k+1 is:

(14)   \begin{equation*} h_{i}^{k+1}=\sigma (z_{i}^{k+1})= \sigma \left(b_{k+1} + \sum_{j=1}^{n}w_{i,j}^{k}\cdot h_{j}^{k} \right) \end{equation*}

Visually:

backpropagation: general case

For the general case, Equation (6) becomes:

(15)   \begin{equation*} \frac{\partial{J}}{\partial{w_{i,j}^{k}}} = \frac{\partial{z_{i}^{k+1}}}{\partial{w_{i,j}^{k}}} \;\frac{\partial{h_{i}^{k+1}}}{\partial{z_{i}^{k+1}}} \;\frac{\partial{J}}{\partial{h_{i}^{k+1}}} \end{equation*}

Following the same approach, we can derive the general formula for the partial derivative of the bias units:

(16)   \begin{equation*} \frac{\partial{J}}{\partial{b_{k+1}}} = \frac{\partial{z_{i}^{k+1}}}{\partial{b_{k+1}}} \;\frac{\partial{h_{i}^{k+1}}}{\partial{z_{i}^{k+1}}} \;\frac{\partial{J}}{\partial{h_{i}^{k+1}}} \end{equation*}

3.7. Backpropagation = Calculating Derivatives \neq Training

We shouldn’t confuse the backpropagation algorithm with the training algorithms. Backpropagation is a strategy to compute the gradient in a neural network. The method that does the updates is the training algorithm. For example, Gradient Descent, Stochastic Gradient Descent, and Adaptive Moment Estimation.

Lastly, since backpropagation is a general technique for calculating the gradients, we can use it for any function, not just neural networks. Additionally, backpropagation isn’t restricted to feedforward networks. We can apply it to recurrent neural networks as well.

4. Conclusion

In this article, we explained the difference between Feedforward Neural Networks and Backpropagation. The former term refers to a type of network without feedback connections forming closed loops. The latter is a way of computing the partial derivatives during training.

When using a model after training, the input “flows” forward through the layers from the input to the output. But, while training the network using backpropagation, we update the parameters in the opposite direction: from the output layer to the input one.