1. Introduction

Artificial neural networks are powerful methods for mapping unknown relationships in data and making predictions. One of the main areas of application is pattern recognition problems. It includes both classification and functional interpolation problems in general, and extrapolation problems, such as time series prediction.

Rarely, neural networks, as well as statistical methods in general, are applied directly to the raw data of a dataset. Normally, we need a preparation that aims to facilitate the network optimization process and maximize the probability of obtaining good results.

In this tutorial, we’ll take a look at some of these methods. They include normalization techniques, explicitly mentioned in the title of this tutorial, but also others such as standardization and rescaling.

We’ll use all these concepts in a more or less interchangeable way, and we’ll consider them collectively as normalization or preprocessing techniques.

2. Definitions

The different forms of preprocessing that we mentioned in the introduction have different advantages and purposes.

2.1. Normalization

Normalizing a vector (for example, a column in a dataset) consists of dividing data from the vector norm. Typically we use it to obtain the Euclidean distance of the vector equal to a certain predetermined value, through the transformation below, called min-max normalization:

    [x'=\frac{x-x_{\min}}{x_{\max}-x_{\min}}(u-l)+l]

where:

  • x is the original data.
  • x' is the normalized data.
  • x _ {\ max}, x _ {\ min} are respectively the maximum and minimum values of the original vector.
  • u, l are respectively the upper and lower values of the new range for normalized data, x ' \in [l: u]. Typical values are [u = 1, l = 0] or [u = 1, l = -1].

The above equation is a linear transformation that maintains all the distance ratios of the original vector after normalization.

Some authors make a distinction between normalization and rescaling. The latter transformation is associated with changes in the unit of data, but we’ll consider it a form of normalization.

2.2. Standardization

Standardization consists of subtracting a quantity related to a measure of localization or distance and dividing by a measure of the scale. The best-known example is perhaps the called z-score or standard score:

    [x'=\frac{x-\mu}{\sigma}]

where:

  • \mu is the mean of the population.
  • \sigma is the standard deviation of the population.

The z-score transforms the original data to obtain a new distribution with mean 0 and standard deviation 1.

Since generally we don’t know the values of these parameters for the whole population, we must use their sample counterparts:

    [\hat{\mu}=\frac{1}{N}\sum_{i=1}^{N}x_{i}]

    [\hat{\sigma}=\sqrt{\frac{1}{N-1}\sum_{i=1}^{N}(x_{i}-\hat{\mu})^{2}}]

where N is the size of the vector \mathbf {x}.

2.3. Batch Normalization

Another technique widely used in deep learning is batch normalization. Instead of normalizing only once before applying the neural network, the output of each level is normalized and used as input of the next level. This speeds up the convergence of the training process.

2.4. A Note on Usage

The application of the most suitable standardization technique implies a thorough study of the problem data. For example, if the dataset does not have a normal or more or less normal distribution for some feature, the z-score may not be the most suitable method.

The nature of the problem may recommend applying more than one preprocessing technique.

3. A Review on Normalization

Is it always necessary to apply a normalization or in general some form of data preprocessing before applying a neural network? We can give two responses to this question.

From a theoretical-formal point of view, the answer is: it depends. Depending on the data structure and the nature of the network we want to use, it may not be necessary.

Let’s take an example. Suppose we want to apply a linear rescaling, like the one seen in the previous section, and to use a network with linear form activation functions:

    [y = w_ {0} + \sum_ {i = 1} ^ {N} w_ {i} x_ {i}]

where y is the output of the network, \mathbf {x} is the input vector with N components x_ {i}, and w_ {i} are the components of the weight vector, with w_ {0} the bias. In this case, normalization is not strictly necessary.

The reason lies in the fact that, in the case of linear activation functions, a change of scale of the input vector can be undone by choosing appropriate values of the vector \mathbf {w}. If the training algorithm of the network is sufficiently efficient, it should theoretically find the optimal weights without the need for data normalization.

The second answer to the initial question comes from a practical point of view. In this case, the answer is: always normalize. The reasons are many and we’ll analyze them in the next sections.

3.1. Target Normalization

The numerical results before and after the transformations are in the table below. The reference for normality is skewness = 0 and kurtosis = 0 :

    [\begin{array}{lcc} \hline \mathrm{} & \mathrm{Skewness} & \mathrm{Kurtosis}\\ \hline\hline \mathrm{Original\,data} & 1.1137 & 5.3265\\ \mathrm{Box-Cox} & 0.0183 & 0.9733\\ \mathrm{Yeo-Johnson} & 0.0044 & 0.8648 \\\hline \end{array}]