In mathematics and computing, the Levenberg–Marquardt algorithm (LMA or just LM), also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. These minimization problems arise especially in least squares curve fitting. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be slower than the GNA. LMA can also be viewed as Gauss–Newton using a trust region approach.
The algorithm was first published in 1944 by Kenneth Levenberg, while working at the Frankford Army Arsenal. It was rediscovered in 1963 by Donald Marquardt, who worked as a statistician at DuPont, and independently by Girard, Wynne and Morrison.
The LMA is used in many software applications for solving generic curve-fitting problems. By using the Gauss–Newton algorithm it often converges faster than first-order methods.[1] However, like other iterative optimization algorithms, the LMA finds only a local minimum, which is not necessarily the global minimum.
The primary application of the Levenberg–Marquardt algorithm is in the least-squares curve fitting problem: given a set of
m
\left(xi,yi\right)
f\left(x,\boldsymbol\beta\right)
S\left(\boldsymbol\beta\right)
\hat{\boldsymbol\beta}\in\operatorname{argmin}\limits\boldsymbol\betaS\left(\boldsymbol\beta\right)\equiv\operatorname{argmin}\limits\boldsymbol\beta
m | |
\sum | |
i=1 |
\left[yi-f\left(xi,\boldsymbol\beta\right)\right]2,
Like other numeric minimization algorithms, the Levenberg–Marquardt algorithm is an iterative procedure. To start a minimization, the user has to provide an initial guess for the parameter vector . In cases with only one minimum, an uninformed standard guess like
\boldsymbol\betaT=\begin{pmatrix}1, 1, ..., 1\end{pmatrix}
In each iteration step, the parameter vector is replaced by a new estimate . To determine, the function
f\left(xi,\boldsymbol\beta+\boldsymbol\delta\right)
f\left(xi,\boldsymbol\beta+\boldsymbol\delta\right) ≈ f\left(xi,\boldsymbol\beta\right)+Ji\boldsymbol\delta,
where
Ji=
\partialf\left(xi,\boldsymbol\beta\right) | |
\partial\boldsymbol\beta |
The sum
S\left(\boldsymbol\beta\right)
f\left(xi,\boldsymbol\beta+\boldsymbol\delta\right)
S\left(\boldsymbol\beta+\boldsymbol\delta\right) ≈
m | |
\sum | |
i=1 |
\left[yi-f\left(xi,\boldsymbol\beta\right)-Ji\boldsymbol\delta\right]2,
\begin{align} S\left(\boldsymbol\beta+\boldsymbol\delta\right)& ≈ \left\|y-f\left(\boldsymbol\beta\right)-J\boldsymbol\delta\right\|2\\ &=\left[y-f\left(\boldsymbol\beta\right)-J\boldsymbol\delta\right]T\left[y-f\left(\boldsymbol\beta\right)-J\boldsymbol\delta\right]\\ &=\left[y-f\left(\boldsymbol\beta\right)\right]T\left[y-f\left(\boldsymbol\beta\right)\right]-\left[y-f\left(\boldsymbol\beta\right)\right]TJ\boldsymbol\delta-\left(J\boldsymbol\delta\right)T\left[y-f\left(\boldsymbol\beta\right)\right]+\boldsymbol\deltaTJTJ\boldsymbol\delta\\ &=\left[y-f\left(\boldsymbol\beta\right)\right]T\left[y-f\left(\boldsymbol\beta\right)\right]-2\left[y-f\left(\boldsymbol\beta\right)\right]TJ\boldsymbol\delta+\boldsymbol\deltaTJTJ\boldsymbol\delta. \end{align}
S\left(\boldsymbol\beta+\boldsymbol\delta\right)
\left(JTJ\right)\boldsymbol\delta=JT\left[y-f\left(\boldsymbol\beta\right)\right],
where
J
Ji
f\left(\boldsymbol\beta\right)
y
f\left(xi,\boldsymbol\beta\right)
yi
m x n
n
\boldsymbol\beta
\left(JTJ\right)
n x n
n
n
Levenberg's contribution is to replace this equation by a "damped version":
\left(JTJ+λI\right)\boldsymbol\delta=JT\left[y-f\left(\boldsymbol\beta\right)\right],
where is the identity matrix, giving as the increment to the estimated parameter vector .
The (non-negative) damping factor is adjusted at each iteration. If reduction of is rapid, a smaller value can be used, bringing the algorithm closer to the Gauss–Newton algorithm, whereas if an iteration gives insufficient reduction in the residual, can be increased, giving a step closer to the gradient-descent direction. Note that the gradient of with respect to equals
-2\left(JT\left[y-f\left(\boldsymbol\beta\right)\right]\right)T
When the damping factor is large relative to
\|JTJ\|
JTJ+λI
λ-1JT\left[y-f\left(\boldsymbol\beta\right)\right]
To make the solution scale invariant Marquardt's algorithm solved a modified problem with each component of the gradient scaled according to the curvature. This provides larger movement along the directions where the gradient is smaller, which avoids slow convergence in the direction of small gradient. Fletcher in his 1971 paper A modified Marquardt subroutine for non-linear least squares simplified the form, replacing the identity matrix with the diagonal matrix consisting of the diagonal elements of :
\left[JTJ+λ\operatorname{diag}\left(JTJ\right)\right]\boldsymbol\delta=JT\left[y-f\left(\boldsymbol\beta\right)\right].
A similar damping factor appears in Tikhonov regularization, which is used to solve linear ill-posed problems, as well as in ridge regression, an estimation technique in statistics.
Various more or less heuristic arguments have been put forward for the best choice for the damping parameter . Theoretical arguments exist showing why some of these choices guarantee local convergence of the algorithm; however, these choices can make the global convergence of the algorithm suffer from the undesirable properties of steepest descent, in particular, very slow convergence close to the optimum.
The absolute values of any choice depend on how well-scaled the initial problem is. Marquardt recommended starting with a value and a factor . Initially setting
λ=λ0
S\left(\boldsymbol\beta\right)
λ=λ0
If use of the damping factor results in a reduction in squared residual, then this is taken as the new value of (and the new optimum location is taken as that obtained with this damping factor) and the process continues; if using resulted in a worse residual, but using resulted in a better residual, then is left unchanged and the new optimum is taken as the value obtained with as damping factor.
An effective strategy for the control of the damping parameter, called delayed gratification, consists of increasing the parameter by a small amount for each uphill step, and decreasing by a large amount for each downhill step. The idea behind this strategy is to avoid moving downhill too fast in the beginning of optimization, therefore restricting the steps available in future iterations and therefore slowing down convergence. An increase by a factor of 2 and a decrease by a factor of 3 has been shown to be effective in most cases, while for large problems more extreme values can work better, with an increase by a factor of 1.5 and a decrease by a factor of 5.
When interpreting the Levenberg–Marquardt step as the velocity
\boldsymbol{v}k
\boldsymbol{a}k
\boldsymbol{v}k+
1 | |
2 |
\boldsymbol{a}k
where
\boldsymbol{a}k
\boldsymbol{J}k\boldsymbol{a}k=-fvv.
fvv=\sum\mu\nuv\muv\nu\partial\mu\partial\nuf(\boldsymbol{x})
\boldsymbol{v}
i | |
\begin{align} f | |
vv |
& ≈
fi(\boldsymbol{x | |
+ |
h\boldsymbol{\delta})-2fi(\boldsymbol{x})+fi(\boldsymbol{x}-h\boldsymbol{\delta})}{h2}\\ &=
2 | |
h |
\left(
fi(\boldsymbol{x | |
+ |
h\boldsymbol{\delta})-fi(\boldsymbol{x})}{h}-\boldsymbol{J}i\boldsymbol{\delta}\right) \end{align}
where
f(\boldsymbol{x})
\boldsymbol{J}
f(\boldsymbol{x}+h\boldsymbol{\delta})
h
Since the acceleration may point in opposite direction to the velocity, to prevent it to stall the method in case the damping is too small, an additional criterion on the acceleration is added in order to accept a step, requiring that
2\left\|\boldsymbol{a | |
k |
\right\|}{\left\|\boldsymbol{v}k\right\|}\le\alpha
where
\alpha
The addition of a geodesic acceleration term can allow significant increase in convergence speed and it is especially useful when the algorithm is moving through narrow canyons in the landscape of the objective function, where the allowed steps are smaller and the higher accuracy due to the second order term gives significant improvements.
In this example we try to fit the function
y=a\cos\left(bX\right)+b\sin\left(aX\right)
a=100
b=102
\cos\left(\betax\right)
\hat\beta
\hat\beta+2n\pi