Simultaneous perturbation stochastic approximation (SPSA) is an algorithmic method for optimizing systems with multiple unknown parameters. It is a type of stochastic approximation algorithm. As an optimization method, it is appropriately suited to large-scale population models, adaptive modeling, simulation optimization, and atmospheric modeling. Many examples are presented at the SPSA website http://www.jhuapl.edu/SPSA. A comprehensive book on the subject is Bhatnagar et al. (2013). An early paper on the subject is Spall (1987) and the foundational paper providing the key theory and justification is Spall (1992).
SPSA is a descent method capable of finding global minima, sharing this property with other methods such as simulated annealing. Its main feature is the gradient approximation that requires only two measurements of the objective function, regardless of the dimension of the optimization problem. Recall that we want to find the optimal control
u*
J(u)
u*=\argminuJ(u).
Both Finite Differences Stochastic Approximation (FDSA)and SPSA use the same iterative process:
un+1=un-an\hat{g}n(un),
where
un=((un)1,(un)2,\ldots,(un)
T | |
p) |
nth
\hat{g}n(un)
g(u)=
\partial | |
\partialu |
J(u)
{un}
\{an\}
un
ith
FD:
(\hat{gn}(un))i=
J(un+cnei)-J(un-cnei) | |
2cn |
,
1 ≤i ≤p, where
ei
ith
cn
gn
Let now
\Deltan
ith
SP:
(\hat{gn}(un))i=
J(un+cn\Deltan)-J(un-cn\Deltan) | |
2cn(\Deltan)i |
.
Remark that FD perturbs only one direction at a time, while the SP estimator disturbs all directions at the same time (the numerator is identical in all p components). The number of loss function measurements needed in the SPSA method for each
gn
Simple experiments with p=2 showed that SPSA converges in the same number of iterations as FDSA. The latter follows approximately the steepest descent direction, behaving like the gradient method. On the other hand, SPSA, with the random search direction, does not follow exactly the gradient path. In average though, it tracks it nearly because the gradient approximation is an almost unbiasedestimator of the gradient, as shown in the following lemma.
Denote by
bn=E[\hat{g}n|un]-\nablaJ(un)
the bias in the estimator
\hat{g}n
\{(\Deltan)i\}
E(|(\Deltan)
-1 | |
i| |
)
bn
The main idea is to use conditioning on
\Deltan
E[(\hat{g}n)i]
J(un+cn\Deltan)i
J(un-cn\Deltan)i
\{(\Deltan)i\}
E[(\hat{g}n)i]=(gn)i+
2) | |
O(c | |
n |
The result follows from the hypothesis that
cn
Next we resume some of the hypotheses under which
ut
J(u)
J(u)
an
cn
\Deltani
an
an
infty | |
\sum | |
n=1 |
an=infty
a | ||||
|
;
c | ||||
|
\gamma\in\left[
1 | , | |
6 |
1 | |
2 |
\right]
infty | ||
\sum | ( | |
n=1 |
an | |
cn |
)2<infty
\Deltani
\Deltani<a1<infty
\Deltani
\Deltani
The loss function J(u) must be thrice continuously differentiable and the individual elements of the third derivative must be bounded:
|J(3)(u)|<a3<infty
|J(u)| → infty
u → infty
In addition,
\nablaJ
u |
=g(u)
uk
It has been shown that differentiability is not required: continuity and convexity are sufficient for convergence.[1]
It is known that a stochastic version of the standard (deterministic) Newton-Raphson algorithm (a “second-order” method) provides an asymptotically optimal or near-optimal form of stochastic approximation. SPSA can also be used to efficiently estimate the Hessian matrix of the loss function based on either noisy loss measurements or noisy gradient measurements (stochastic gradients). As with the basic SPSA method, only a small fixed number of loss measurements or gradient measurements are needed at each iteration, regardless of the problem dimension p. See the brief discussion in Stochastic gradient descent.