In calculus, Newton's method (also called Newton–Raphson) is an iterative method for finding the roots of a differentiable function, which are solutions to the equation . As such, Newton's method can be applied to the derivative of a twice-differentiable function to find the roots of the derivative (solutions to), also known as the critical points of . These solutions may be minima, maxima, or saddle points; see section "Several variables" in Critical point (mathematics) and also section "Geometric interpretation" in this article. This is relevant in optimization, which aims to find (global) minima of the function .
The central problem of optimization is minimization of functions. Let us first consider the case of univariate functions, i.e., functions of a single real variable. We will later consider the more general and more practically useful multivariate case.
Given a twice differentiable function
f:R\toR
minx\inf(x).
\{xk\}
x0\inR
x*
f
f
xk
f(xk+t) ≈ f(xk)+f'(xk)t+
1 | |
2 |
f''(xk)t2.
The next iterate
xk+1
t
xk+1=xk+t
t
\displaystyle0=
\rmd | |
{\rmd |
t}\left(f(xk)+f'(xk)t+
1 | |
2 |
f''(xk)t2\right)=f'(xk)+f''(xk)t,
t=-
f'(xk) | |
f''(xk) |
.
xk+1=xk+t=xk-
f'(xk) | |
f''(xk) |
.
The geometric interpretation of Newton's method is that at each iteration, it amounts to the fitting of a parabola to the graph of
f(x)
xk
f
The above iterative scheme can be generalized to
d>1
f'(x)=\nablaf(x)=gf(x)\inRd
f''(x)=\nabla2f(x)=Hf(x)\inRd x
xk+1=xk-
-1 | |
[f''(x | |
k)] |
f'(xk), k\ge0.
0<\gamma\le1
\gamma=1
xk+1=xk-\gamma
-1 | |
[f''(x | |
k)] |
f'(xk).
If is a strongly convex function with Lipschitz Hessian, then provided that
x0
x*=\argminf(x)
x0,x1,x2,...
x*
f
\|xk+1-x*\|\leq
1 | |
2 |
\|xk
2, | |
-x | |
*\| |
\forallk\geq0.
Finding the inverse of the Hessian in high dimensions to compute the Newton direction
h=-
-1 | |
(f''(x | |
k)) |
f'(xk)
h
[f''(xk)]h=-f'(xk)
which may be solved by various factorizations or approximately (but to great accuracy) using iterative methods. Many of these methods are only applicable to certain types of equations, for example the Cholesky factorization and conjugate gradient will only work if
f''(xk)
f''(xk)
On the other hand, if a constrained optimization is done (for example, with Lagrange multipliers), the problem may become one of saddle point finding, in which case the Hessian will be symmetric indefinite and the solution of
xk+1
LDL\top
There also exist various quasi-Newton methods, where an approximation for the Hessian (or its inverse directly) is built up from changes in the gradient.
If the Hessian is close to a non-invertible matrix, the inverted Hessian can be numerically unstable and the solution may diverge. In this case, certain workarounds have been tried in the past, which have varied success with certain problems. One can, for example, modify the Hessian by adding a correction matrix
Bk
f''(xk)+Bk
Bk
f''(xk)+Bk
\epsilon>0
An approach exploited in the Levenberg–Marquardt algorithm (which uses an approximate Hessian) is to add a scaled identity matrix to the Hessian,
\muI
\mu
1/\mu
Newton's method, in its original version, has several caveats:
The popular modifications of Newton's method, such as quasi-Newton methods or Levenberg-Marquardt algorithm mentioned above, also have caveats:
For example, it is usually required that the cost function is (strongly) convex and the Hessian is globally bounded or Lipschitz continuous, for example this is mentioned in the section "Convergence" in this article. If one looks at the papers by Levenberg and Marquardt in the reference for Levenberg–Marquardt algorithm, which are the original sources for the mentioned method, one can see that there is basically no theoretical analysis in the paper by Levenberg, while the paper by Marquardt only analyses a local situation and does not prove a global convergence result. One can compare with Backtracking line search method for Gradient descent, which has good theoretical guarantee under more general assumptions, and can be implemented and works well in practical large scale problems such as Deep Neural Networks.