梯度下降法与牛顿梯度下降法

1. Overview

In this tutorial, we’ll study the differences between two renowned methods for finding the minimum of a cost function.

These methods are the gradient descent, well-used in machine learning, and Newton’s method, more common in numerical analysis.

At the end of this tutorial, we’ll know under what conditions we can use one or the other for solving optimization problems.

2. Gradient Descent

2.1. A Gradual Decrease

In our article on the Java implementation of gradient descent, we studied how this algorithm helps us find the optimal parameters in a machine learning model. We also discussed how gradient descent, or its cousin gradient ascent, can iteratively approximate the local minimum of a function with an arbitrary degree of precision.

We start with a continuous function that we want to minimize, and that’s differentiable in the interval of our interest. Then, we identify a starting point that is sufficient proximity to the function’s local minimum, in this case, $\theta_0$ :

Rendered by QuickLaTeX.com

Then, iteratively, we move towards the closest local minimum by exploiting the gradient of the function around $\theta_2$ .

2.2. Conditions for Application

In order to apply gradient descent, we have to satisfy some relatively accessible conditions for application.

The primary condition is that the function must be continuous, differentiable, and convex. The function’s gradient also needs to satisfy the requirements for Lipschitz continuity. This means that there should be a value , such that the derivative f'(x) in its absolute value should always be lower than .

If we meet these conditions, we can then compute the gradient $\nabla f(\theta)$ of the function. Then, according to its sign, we update the parameters in the direction that decreases the cost function. In the case of backpropagation for neural networks and their weights, for example, we apply this formula:

$\theta_{n+1} = \theta_n - \alpha \nabla f(\theta_n)$

In the case of logistic regression, analogously, we use a cost function that contains a logarithmic expression and we apply gradient descent on it.

For an appropriate choice of the learning rate $\alpha$ , the n+1 iteration returns a better approximation of the minimum of in terms of $\theta$ . Thus, if we repeat the process iteratively, we can reach a point that’s arbitrarily close to the function’s minimum.

3. Newton’s Method

3.1. Description of Newton’s Method

Newton’s method works in a different manner. This is because it’s a method for finding the root of a function, rather than its maxima or minima.

This means that, if the problem satisfies the constraints of Newton’s method, we can find for which f(x)=0 . Not f'(x)=0 , as was the case for gradient descent.

We, therefore, apply Newton’s method on the derivative $\textbf{f'(x)}$ of the cost function, not on the cost function itself. This is important because Newton’s method requires the analytical form of the derivative of any input function we use, as we’ll see shortly. Therefore, this means that the cost function we use must be differentiable twice, not just once, as was the case for gradient descent.

3.2. Definition

Let’s define f(x) as the cost function of a model on which we apply Newton’s method. The terms f'(x) and f''(x) thus indicate, respectively, the first and second-order derivatives of f(x) . If we start from a point x_n that’s sufficiently close to the minimum of , we can then get a better approximation by computing this formula:

$x_{n+1} = x_n + \frac {f'(x_n)} {f''(x_n)}$

The term $\frac {f'(x)} {f''(x)}$ , here, indicates that we’re approximating the function f'(x) with a linear model, in proximity of x_n .

4. The Primary Differences

The two methods aren’t equivalent, and as a general rule, we can’t replace one with the other.

The first difference lies in the fact that gradient descent is parametric according to the learning rate $\alpha$ . Newton’s method isn’t parametric, which means that we can apply it without worrying for hyperparameter optimization. A parametric version of Newton’s method also exists, in truth, but it only applies in cases for which we operate with a polynomial function with multiple roots.

The second difference has to do with the cost function on which we apply the algorithm. Newton’s method has stronger constraints in terms of the differentiability of the function than gradient descent. If the second derivative of the function is undefined in the function’s root, then we can apply gradient descent on it but not Newton’s method.

The third difference consists of the behavior around stationary points. If gradient descent encounters a stationary point during iteration, the program continues to run, albeit the parameters don’t update. Newton’s method, however, requires to compute $\frac{f'(x)}{f''(x)}$ for f'(x) = f''(x) = 0 . The program that runs it would therefore terminate with a division by zero error.

5. Conclusion

In this article, we compared gradient descent and Newton’s method for finding the minima in a cost function.

Persistence

REST

Security