Compact quasi-Newton representation explained

The compact representation for quasi-Newton methods is a matrix decomposition, which is typically used in gradient based optimization algorithms or for solving nonlinear systems. The decomposition uses a low-rank representation for the direct and/or inverse Hessian or the Jacobian of a nonlinear system. Because of this, the compact representation is often used for large problems and constrained optimization.

Definition

The compact representation of a quasi-Newton matrix for the inverse Hessian

Hk

ordirect Hessian

Bk

of a nonlinear objective function

f(x):Rn\toR

expresses a sequence of recursive rank-1or rank-2 matrix updates as one rank-

k

or rank-

2k

update of an initial matrix.[1] [2] Because it is derived from quasi-Newton updates,it uses differences of iterates and gradients

\nablaf(xk)=gk

in its definition

\{si-1=xi-xi-1,yi-1=gi-gi-1

k
\}
i=1
.In particular, for

r=k

or

r=2k

the rectangular

n x r

matrices

Uk,Jk

and the

r x r

square symmetric systems

Mk,Nk

depend on the

si,yi

's and define the quasi-Newton representations

Hk=H0+Uk

-1
M
k
T
U
k,

andBk=B0+Jk

-1
N
k
T
J
k

Applications

Because of the special matrix decomposition the compact representation is implemented in state-of-the-art optimization software.[3] [4] [5] [6] When combined with limited-memory techniques it is a popular technique for constrained optimization with gradients.[7] Linear algebra operations can be done efficiently, like matrix-vector products, solves or eigendecompositions. It can be combinedwith line-search and trust region techniques, and the representation has been developed for many quasi-Newtonupdates. For instance, the matrix vector product with the direct quasi-Newton Hessian and an arbitraryvector

g\inRn

is:
(0)
\begin{align} p
k

&=

T
J
k

g\\ solveNk

(1)
p
k

&=

(0)
p
k

(Nkissmall)

(2)
\\ p
k

&=Jk

(1)
p
k
(3)
\\ p
k

&=H0g\\ p\phantom{(4)

}_k &= p^_k + p^_k\end

Background

In the context of the GMRES method, Walker[8] showed that a product of Householder transformations (an identity plus rank-1) can be expressed as a compact matrix formula. This led to the derivationof an explicit matrix expression for the product of

k

identity plus rank-1 matrices.Specifically, for S_k = \begin s_0 & s_1 & \ldots s_ \end,

~Yk=\begin{bmatrix}y0&y1&\ldotsyk-1\end{bmatrix},

~(Rk)ij=

T
s
i-1

yj-1,

~\rhoi-1=

T
1/s
i-1

yi-1

and~V_i = I - \rho_ y_ s^T_ when

1\lei\lej\lek

the product of

k

rank-1 updates to the identity is \prod_^k V_ = \left(I - \rho_ y_ s^T_ \right) \cdots \left(I - \rho_ y_ s^T_ \right)= I - Y_k R^_k S^T_k The BFGS update can be expressed in terms of products of the

Vi

's, which have a compact matrix formula. Therefore, the BFGS recursion can exploit these block matrixrepresentations

Recursive quasi-Newton updates

A parametric family of quasi-Newton updates includes many of the most known formulas.[9] Forarbitrary vectors

vk

and

ck

such that
T
v
k

yk\ne0

and
T
c
k

sk\ne0

general recursive update formulas for the inverse and direct Hessianestimates are

By making specific choices for the parameter vectors

vk

and

ck

well knownmethods are recovered
Table 1: Quasi-Newton updates parametrized by vectors

vk

and

ck

vk

method

ck

method

sk

sk

PSB (Powell Symmetric Broyden)

yk

Greenstadt's

yk

sk-Hkyk

yk-Bksk

SR1
S
P
k

sk

[10]
MSS (Multipoint-Symmetric-Secant)

Compact Representations

Collecting the updating vectors of the recursive formulas into matrices, define

S_k = \begin s_0 & s_1 & \ldots & s_ \end, Y_k = \begin y_0 & y_1 & \ldots & y_ \end, V_k = \begin v_0 & v_1 & \ldots & v_ \end, C_k = \begin c_0 & c_1 & \ldots & c_ \end,

upper triangular

\big(R_k\big)_ := \big(R^_k\big)_ = s^T_y_, \quad \big(R^_k\big)_ = v^T_y_, \quad \big(R^_k\big)_ = c^T_s_, \quad \quad \text 1 \le i \le j \le k

lower triangular

\big(L_k\big)_ := \big(L^_k\big)_ = s^T_y_, \quad \big(L^_k\big)_ = v^T_y_, \quad \big(L^_k\big)_ = c^T_s_, \quad \quad \text 1 \le j < i \le k

and diagonal

(D_k)_ := \big(D^_k\big)_ = s^T_y_, \quad \quad \text 1 \le i = j \le k

With these definitions the compact representations of general rank-2 updates in and (including the well known quasi-Newton updates in Table 1) have been developed in Brust:[11]

U_k = \begin V_k & S_k - H_0 Y_k \end

M_k = \begin 0_ & R^_k \\ \big(R^_k \big)^T & R_k+R^T_k-(D_k+Y^T_k H_0 Y_k) \end

and the formula for the direct Hessian is

J_k = \begin C_k & Y_k - B_0 S_k \end

N_k = \begin 0_ & R^_k \\ \big(R^_k \big)^T & R_k+R^T_k-(D_k+S^T_k B_0 S_k) \end

For instance, when

Vk=Sk

the representation in isthe compact formula for the BFGS recursion in .

Specific Representations

Prior to the development of the compact representations of and,equivalent representations have been discovered for most known updates (see Table 1).

Along with the SR1 representation, the BFGS (Broyden-Fletcher-Goldfarb-Shanno) compact representation was the first compact formula known. In particular, the inverse representation is given by

Hk=H0+Uk

-1
M
k
T
U
k,

Uk=\begin{bmatrix}Sk&H0Yk\end{bmatrix},

-1
M
k

= \left[\begin{smallmatrix}

-T
R
k(D
T
k

H0Yk)

-1
R
k

&

-T
-R
k
-1
\ -R
k

&0\end{smallmatrix}\right]

The direct Hessian approximation can be found by applying the Sherman-Morrison-Woodbury identity to the inverse Hessian:

Bk=B0+Jk

-1
N
k
T
J
k,

Jk=\begin{bmatrix}B0Sk&Yk\end{bmatrix},Nk= \left[\begin{smallmatrix}STB0Sk&Lk

T
\L
k

&-Dk\end{smallmatrix}\right]

The SR1 (Symmetric Rank-1) compact representation was first proposed in. Using the definitions of

Dk,Lk

and

Rk

from above, the inverse Hessian formula is given by

Hk=H0+Uk

-1
M
k
T
U
k,

Uk=Sk-H0Yk,Mk

T
= R
k-D
T
k

H0Yk

The direct Hessian is obtained by the Sherman-Morrison-Woodbury identity and has the form

Bk=B0+Jk

-1
N
k
T
J
k,

Jk=Yk-B0Sk,Nk= Dk+L

T
k

B0Sk

MSS

The multipoint symmetric secant (MSS) method is a method that aims to satisfy multiple secant equations. The recursiveupdate formula was originally developed by Burdakov.[12] The compact representation for the direct Hessian was derived in [13]

Bk=B0+Jk

-1
N
k
T
J
k,

Jk=\begin{bmatrix}Sk&Yk-B0Sk\end{bmatrix},Nk= \left[\begin{smallmatrix}

T
W
k

B0Sk-(Rk-Dk

T
+R
k))W

k&Wk\Wk&0\end{smallmatrix}\right]-1,Wk=

T
(S
k
-1
S
k)

Another equivalent compact representation for the MSS matrix is derived by rewriting

Jk

in terms of

Jk=\begin{bmatrix}Sk&B0Yk\end{bmatrix}

.[14] The inverse representation can be obtained by application for the Sherman-Morrison-Woodbury identity.

Since the DFP (Davidon Fletcher Powell) update is the dual of the BFGS formula (i.e., swapping

Hk\leftrightarrowBk

,

H0\leftrightarrowB0

and

yk\leftrightarrowsk

in the BFGS update), the compact representation for DFP can be immediately obtained from the one for BFGS.[15]

PSB

The PSB (Powell-Symmetric-Broyden) compact representation was developed for the direct Hessian approximation.[16] It is equivalent to substituting

Ck=Sk

in

Bk=B0+Jk

-1
N
k
T
J
k,

Jk=\begin{bmatrix}Sk&Yk-B0Sk\end{bmatrix},Nk= \left[\begin{smallmatrix}0&

SS
R
k
T
\(R
k)

&

T
R
k-(D

k+

T
S
k

B0Sk)\end{smallmatrix}\right]

Structured BFGS

For structured optimization problems in which the objective function can be decomposed into two parts

f(x)=\widehat{k}(x)+\widehat{u}(x)

, where the gradients and Hessian of

\widehat{k}(x)

are known but only the gradient of

\widehat{u}(x)

is known, structured BFGS formulasexist. The compact representation of these methods has the general form of,with specific

Jk

and

Nk

.[17]

Reduced BFGS

The reduced compact representation (RCR) of BFGS is for linear equality constrained optimization

minimizef(x)subjectto:Ax=b

, where

A

is underdetermined. In addition to the matrices

Sk,Yk

the RCR also stores the projections of the

yi

's onto the nullspace of

A

Bk

the compact representation of the BFGS matrix (with a multiple of the identity

B0

) the (1,1) block of the inverse KKT matrix has the compact representation[18]

Kk= \begin{bmatrix} Bk&AT\\ A&0 \end{bmatrix}, B0=

1
\gammak

I,H0=\gammakI,\gammak>0

\big(K^_k \big)_ = H_0 + U_k M^_k U^T_k, \quad U_k = \begin A^T & S_k & Z_k \end, \quad M_k =\left[\begin{smallmatrix} - AA^T / \gamma_k & \\ & G_k \end{smallmatrix} \right], \quad G_k = \left[\begin{smallmatrix} R^{-T}_k(D_k+Y^T_k H_0 Y_k) R^{-1}_k & -H_0 R^{-T}_k \\ -H_0R^{-1}_k & 0 \end{smallmatrix} \right]^

Limited Memory

The most common use of the compact representations is for the limited-memory setting where

m\lln

denotes the memory parameter,with typical values around

m\in[5,12]

(see e.g., [7]). Then, insteadof storing the history of all vectors one limits this to the

m

most recent vectors

\{(si,yi

k-1
\}
i=k-m

and possibly

\{vi

k-1
\}
i=k-m

or

\{ci

k-1
\}
i=k-m

.Further, typically the initialization is chosen as an adaptive multiple of the identity
(0)
H
k

=\gammakI

,with

\gammak=

T
y
k-1

sk-1/

T
y
k-1

yk-1

and
(0)
B
k

=

1
\gammak

I

. Limited-memory methods are frequently used for large-scale problems with many variables (i.e.,

n

can be large), in which the limited-memory matrices

Sk\inRn

and

Yk\inRn

(and possibly

Vk,Ck

) are tall and very skinny:

Sk=\begin{bmatrix}sk-l-1&\ldots&sk-1\end{bmatrix}

and

Yk=\begin{bmatrix}yk-l-1&\ldots&yk-1\end{bmatrix}

.

Implementations

Open source implementations include:

Non open source implementations include:

Notes and References

  1. Book: Nocedal. J.. Wright. S.J.. 2006 . Numerical Optimization . Springer Series in Operations Research and Financial Engineering. Springer New York, NY . 10.1007/978-0-387-40065-5. 978-0-387-30303-1.
  2. Brust . J. J. . 2018 . Large-Scale Quasi-Newton Trust-Region Methods: High-Accuracy Solvers, Dense Initializations, and Extensions . PhD . University of California, Merced.
  3. Book: Byrd. R. H.. Nocedal. J. Waltz. R. A.. Large-Scale Nonlinear Optimization. 2006 . KNITRO: An integrated package for nonlinear optimization . Nonconvex Optimization and Its Applications. 83. https://doi.org/10.1007/0-387-30065-1_4. In: Di Pillo, G., Roma, M. (eds) Large-Scale Nonlinear Optimization. Nonconvex Optimization and Its Applications, vol 83. . Springer, Boston, MA. . 35-59. 10.1007/0-387-30065-1_4. 978-0-387-30063-4.
  4. 10.1145/279232.279236. Zhu. C.. Byrd. R. H.. Lu. P.. Nocedal. J.. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. 1997. ACM Transactions on Mathematical Software (TOMS). 23. 4. 550-560.
  5. 10.1145/3550269. Brust. J.. Burdakov. O.. Erway. J.. Marcia. R.. Algorithm 1030: SC-SR1: MATLAB software for limited-memory SR1 trust-region methods. 2022. ACM Transactions on Mathematical Software (TOMS). 48. 4. 1-33.
  6. 10.1007/s10107-004-0559-y. Wächter. A.. Biegler. L. T.. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. 2006. Mathematical Programming. 106. 25-57.
  7. 10.1007/BF01582063. Byrd. R. H.. Nocedal. J.. Schnabel. R. B.. Representations of Quasi-Newton Matrices and their use in Limited Memory Methods. 1994. Mathematical Programming. 63. 4. 129–156. 5581219.
  8. 10.1137/0909010. Walker. H. F.. Implementation of the GMRES Method Using Householder Transformations. 1988. SIAM Journal on Scientific and Statistical Computing. 9. 1. 152–163 .
  9. 10.1137/1019005. Dennis. Jr, J. E.. Moré. J. J.. Quasi-Newton methods, motivation and theory. 1977. SIAM Review. 19. 1. 46-89. 1813/6056 . free.
  10. Sk+1=\begin{bmatrix} s0&\ldots&sk

    S
    \end{bmatrix},~P
    k

    =I-Sk+1

    T
    (S
    k+1

    Sk+1)-1

    T
    S
    k+1
  11. 2403.12206 . math.OC . J. J. . Brust . Useful Compact Representations for Data-Fitting . 2024.
  12. 10.1080/01630568308816160. Burdakov. O. P.. Methods of the secant type for systems of equations with symmetric Jacobian matrix. 1983. Numerical Functional Analysis and Optimization. 6. 2. 1–18.
  13. 10.1023/A:1021561204463. Burdakov. O. P.. Martínez. J. M. . Pilotta. E. A.. A limited-memory multipoint symmetric secant method for bound constrained optimization. 2002. Annals of Operations Research . 117. 51–70.
  14. 10.1080/10556788.2023.2296441. Brust. J. J.. Erway. J. B. . Marcia. R. F.. Shape-changing trust-region methods using multipoint symmetric secant matrices. 2024. Optimization Methods and Software . 39 . 5 . 990–1007. 2209.12057 .
  15. Erway . J. B. . Jain . V. . Marcia . R. F.. Shifted limited-memory DFP systems . 2013 . In 2013 Asilomar Conference on Signals, Systems and Computers . IEEE. 1033–1037 .
  16. 10.1007/s12532-023-00238-4. Kanzow. C. . Steck. D. . Regularization of limited memory quasi-Newton methods for large-scale nonconvex minimization. 2023. Mathematical Programming Computation. 15. 3. 417–444. free.
  17. 10.1007/s10589-021-00297-0. Brust. J. J . Di. Z.. Leyffer. S.. Petra. C. G. . Compact representations of structured BFGS matrices. 2021. Computational Optimization and Applications. 80. 1. 55–88.
  18. 10.1137/21M1393819. Brust. J. J . Marcia. R.F.. Petra. C.G.. Saunders. M. A. . Large-scale optimization with linear equality constraints using reduced compact representation. 2022. SIAM Journal on Scientific Computing. 44. 1. A103–A127. 2101.11048 . 2022SJSC...44A.103B .
  19. Web site: Collected Algorithms of the ACM (CALGO). calgo.acm.org.
  20. Web site: TOMS Alg. 1030. calgo.acm.org/1030.zip.
  21. Zhu. 10.1145/279232.279236. C.. Byrd. Richard H.. Lu. Peihuang. Nocedal. Jorge . L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization . 1997. ACM Transactions on Mathematical Software. 23. 4. 550–560. 207228122. free.