Matrix Chernoff bound explained

For certain applications in linear algebra, it is useful to know properties of the probability distribution of the largest eigenvalue of a finite sum of random matrices. Suppose

\{X_k\}

is a finite sequence of random matrices. Analogous to the well-known Chernoff bound for sums of scalars, a bound on the following is sought for a given parameter t:

\Pr\left\{λ_max\left(\sum_kX_k\right)\geqt\right\}

The following theorems answer this general question under various assumptions; these assumptions are named below by analogy to their classical, scalar counterparts. All of these theorems can be found in, as the specific application of a general result which is derived below. A summary of related works is given.

Matrix Gaussian and Rademacher series

Self-adjoint matrices case

Consider a finite sequence

\{A_k\}

of fixed,self-adjoint matrices with dimension

, and let

\{\xi_k\}

be a finite sequence of independent standard normal or independent Rademacher random variables.

Then, for all

t\geq0

\Pr\left\{λ_max\left(\sum_k\xi_kA_k\right)\geqt\right\}\leqd ⋅

	-t^2/2\sigma²
e

where

\sigma²=\Vert\sum_k

	2
A
	k

\Vert.

Rectangular case

Consider a finite sequence

\{B_k\}

of fixed matrices with dimension

d_{1 x}d₂

, and let

\{\xi_k\}

be a finite sequence of independent standard normal or independent Rademacher random variables.Define the variance parameter

\sigma²=max\left\{\Vert\sum_kB_kB

	*

	k

\Vert,\Vert\sum_k

	*B
B
	k

\Vert\right\}.

Then, for all

t\geq0

\Pr\left\{\Vert\sum_k\xi_kB_k\Vert\geqt\right\}\leq(d_1+d₂₎ ⋅

	-t^2/2\sigma²
e

Matrix Chernoff inequalities

The classical Chernoff bounds concern the sum of independent, nonnegative, and uniformly bounded random variables.In the matrix setting, the analogous theorem concerns a sum of positive-semidefinite random matrices subjected to a uniform eigenvalue bound.

Matrix Chernoff I

Consider a finite sequence

\{X_k\}

of independent, random, self-adjoint matrices with dimension

.Assume that each random matrix satisfies

X_k\succeq0 and λ_max(X_k)\leqR

almost surely.

Define

\mu_min=λ_min\left(\sum_kEX_k\right) and \mu_max=λ_max\left(\sum_kEX_k\right).

Then

\Pr\left\{λ_min\left(\sum_kX_k\right)\leq(1-\delta)\mu_min\right\}\leqd ⋅ \left[

	e^-\delta
	(1-\delta)^1-\delta

	\mu_min/R
\right]

for\delta\in[0,1),and

\Pr\left\{λ_max\left(\sum_kX_k\right)\geq(1+\delta)\mu_max\right\}\leqd ⋅ \left[

	e^\delta
	(1+\delta)^1+\delta

	\mu_max/R
\right]

for\delta\geq0.

Matrix Chernoff II

Consider a sequence

\{X_{k:k=1,2,\ldots,n\}}

of independent, random, self-adjoint matrices that satisfy

X_k\succeq0 and λ_max(X_k)\leq1

almost surely.

Compute the minimum and maximum eigenvalues of the average expectation,

\bar{\mu}_min=λ_min\left(

	1
	n

	n
\sum
	k=1

EX_k\right) and \bar{\mu}_max=λ_max\left(

	1
	n

	n
\sum
	k=1

EX_k\right).

Then

\Pr\left\{λ_min\left(

	1
	n

	n
\sum
	k=1

X_k\right)\leq\alpha\right\}\leqd ⋅

	-nD(\alpha\Vert\bar{\mu
e
	min

)} for0\leq\alpha\leq\bar{\mu}_min,and

\Pr\left\{λ_max\left(

	1
	n

	n
\sum
	k=1

X_k\right)\geq\alpha\right\}\leqd ⋅

	-nD(\alpha\Vert\bar{\mu
e
	max

)} for\bar{\mu}_max\leq\alpha\leq1.

The binary information divergence is defined as

D(a\Vertu)=a\left(loga-logu\right)+(1-a)\left(log(1-a)-log(1-u)\right)

for

a,u\in[0,1]

Matrix Bennett and Bernstein inequalities

In the scalar setting, Bennett and Bernstein inequalities describe the upper tail of a sum of independent, zero-mean random variables that are either bounded or subexponential. In the matrixcase, the analogous results concern a sum of zero-mean random matrices.

Bounded case

Consider a finite sequence

\{X_k\}

of independent, random, self-adjoint matrices with dimension

.Assume that each random matrix satisfies

EX_k=0 and λ_max(X_k)\leqR

almost surely.

Compute the norm of the total variance,

\sigma²=\Vert\sum_kE

	2
(X
	k)

\Vert.

Then, the following chain of inequalities holds for all

t\geq0

\begin{align} \Pr\left\{λ_max\left(\sum_kX_k\right)\geqt\right\}&\leqd ⋅ \exp\left(-

	\sigma²
	R²

⋅ h\left(

	Rt
	\sigma²

\right)\right)\\ &\leqd ⋅ \exp\left(

	-t²
	\sigma^2+Rt/3

\right)\\ &\leq\begin{cases} d ⋅ \exp(-3t^2/8\sigma²) &fort\leq\sigma^2/R;\\ d ⋅ \exp(-3t/8R) &fort\geq\sigma^2/R.\\ \end{cases} \end{align}

The function

h(u)

is defined as

h(u)=(1+u)log(1+u)-u

for

u\geq0

Subexponential case

Consider a finite sequence

\{X_k\}

of independent, random, self-adjoint matrices with dimension

.Assume that

EX_k=0 and

	p)
E(X
	k

\preceq

	p!
	2

⋅ R^p-2

	2
A
	k

for

p=2,3,4,\ldots

Compute the variance parameter,

\sigma²=\Vert\sum_k

	2
A
	k

\Vert.

Then, the following chain of inequalities holds for all

t\geq0

\begin{align} \Pr\left\{λ_max\left(\sum_kX_k\right)\geqt\right\}&\leqd ⋅ \exp\left(

	-t^2/2
	\sigma^2+Rt

\right)\\ &\leq\begin{cases} d ⋅ \exp(-t^2/4\sigma²) &fort\leq\sigma^2/R;\\ d ⋅ \exp(-t/4R) &fort\geq\sigma^2/R.\\ \end{cases} \end{align}

Rectangular case

Consider a finite sequence

\{Z_k\}

of independent, random, matrices with dimension

d_{1 x}d₂

.Assume that each random matrix satisfies

EZ_k=0 and \VertZ_k\Vert\leqR

almost surely.Define the variance parameter

\sigma²=max\left\{\Vert\sum_kE(Z_kZ

	*)

	k

\Vert,\Vert\sum_kE

	*Z
(Z
	k)

\Vert\right\}.

Then, for all

t\geq0

\Pr\left\{\Vert\sum_kZ_k\Vert\geqt\right\}\leq(d_1+d₂₎ ⋅ \exp\left(

	-t²/2
	\sigma^2+Rt/3

\right)

holds.^[1]

Matrix Azuma, Hoeffding, and McDiarmid inequalities

Matrix Azuma

The scalar version of Azuma's inequality states that a scalar martingale exhibits normal concentration about its mean value, and the scale for deviations is controlled by the total maximum squared range of the difference sequence.The following is the extension in matrix setting.

Consider a finite adapted sequence

\{X_k\}

of self-adjoint matrices with dimension

, and a fixed sequence

\{A_k\}

of self-adjoint matrices that satisfy

E_k-1X_k=0 and

	2
X
	k

\preceq

	2
A
	k

almost surely.

Compute the variance parameter

\sigma²=\Vert\sum_k

	2
A
	k

\Vert.

Then, for all

t\geq0

\Pr\left\{λ_max\left(\sum_kX_k\right)\geqt\right\}\leqd ⋅

	-t^2/8\sigma²
e

The constant 1/8 can be improved to 1/2 when there is additional information available. One case occurs when each summand

X_k

is conditionally symmetric.Another example requires the assumption that

X_k

commutes almost surely with

A_k

Matrix Hoeffding

Placing addition assumption that the summands in Matrix Azuma are independent gives a matrix extension of Hoeffding's inequalities.

Consider a finite sequence

\{X_k\}

of independent, random, self-adjoint matrices with dimension

, and let

\{A_k\}

be a sequence of fixed self-adjoint matrices.Assume that each random matrix satisfies

EX_k=0 and

	2
X
	k

\preceq

	2
A
	k

almost surely.

Then, for all

t\geq0

\Pr\left\{λ_max\left(\sum_kX_k\right)\geqt\right\}\leqd ⋅

	-t^2/8\sigma²
e

where

\sigma²=\Vert\sum_k

	2
A
	k

\Vert.

An improvement of this result was established in :for all

t\geq0

\Pr\left\{λ_max\left(\sum_kX_k\right)\geqt\right\}\leqd ⋅

	-t^2/2\sigma²
e

where

\sigma²=

	1
	2

\Vert\sum_k

	2
A
	k

	2
EX
	k

\Vert \leq\Vert\sum_k

	2
A
	k

\Vert.

Matrix bounded difference (McDiarmid)

In scalar setting, McDiarmid's inequality provides one common way of bounding the differences by applying Azuma's inequality to a Doob martingale. A version of the bounded differences inequality holds in the matrix setting.

Let

\{Z_{k:k=1,2,\ldots,n\}}

be an independent, family of random variables, and let

be a function that maps

variables to a self-adjoint matrix of dimension

.Consider a sequence

\{A_k\}

of fixed self-adjoint matrices that satisfy

\left(H(z_1,\ldots,z_k,\ldots,z_n)-H(z_1,\ldots,z'_k,\ldots,z_n)\right)²\preceq

	2,
A
	k

where

z_i

and

z'_i

range over all possible values of

Z_i

for each index

.Compute the variance parameter

\sigma²=\Vert\sum_k

	2
A
	k

\Vert.

Then, for all

t\geq0

\Pr\left\{λ_max\left(H(z)-EH(z)\right)\geqt\right\}\leqd ⋅

	-t^2/8\sigma²
e

where

z=(Z_1,\ldots,Z_n)

An improvement of this result was established in (see also):for all

t\geq0

\Pr\left\{λ_max\left(H(z)-EH(z)\right)\geqt\right\}\leqd ⋅

	-t^2/\sigma²
e

where

z=(Z_1,\ldots,Z_n)

and

\sigma²=\Vert\sum_k

	2
A
	k

\Vert.

Survey of related theorems

The first bounds of this type were derived by . Recall the theorem above for self-adjoint matrix Gaussian and Rademacher bounds:For a finite sequence

\{A_k\}

of fixed,self-adjoint matrices with dimension

and for

\{\xi_k\}

a finite sequence of independent standard normal or independent Rademacher random variables, then

\Pr\left\{λ_max\left(\sum_k\xi_kA_k\right)\geqt\right\}\leqd ⋅

	-t^2/2\sigma²
e

where

\sigma²=\Vert\sum_k

	2
A
	k

\Vert.

Ahlswede and Winter would give the same result, except with

	2
\sigma
	AW

=\sum_kλ_max

	2
\left(A
	k

\right)

.By comparison, the

\sigma²

in the theorem above commutes

\Sigma

and

λ_max

; that is, it is the largest eigenvalue of the sum rather than the sum of the largest eigenvalues. It is never larger than the Ahlswede–Winter value (by the norm triangle inequality), but can be much smaller. Therefore, the theorem above gives a tighter bound than the Ahlswede–Winter result.

The chief contribution of was the extension of the Laplace-transform method used to prove the scalar Chernoff bound (see Chernoff bound#Additive form (absolute error)) to the case of self-adjoint matrices. The procedure given in the derivation below. All of the recent works on this topic follow this same procedure, and the chief differences follow from subsequent steps. Ahlswede & Winter use the Golden–Thompson inequality to proceed, whereas Tropp uses Lieb's Theorem.

Suppose one wished to vary the length of the series (n) and the dimensions of thematrices (d) while keeping the right-hand side approximately constant. Thenn must vary approximately as the log of d. Several papers have attempted to establish a bound without a dependence on dimensions. Rudelson and Vershynin give a result for matrices which are the outer product of two vectors. provide a result without the dimensional dependence for low rank matrices. The original result was derived independently from the Ahlswede–Winter approach, but proves a similar result using the Ahlswede–Winter approach.

Finally, Oliveira proves a result for matrix martingales independently from the Ahlswede–Winter framework. Tropp slightly improves on the result using the Ahlswede–Winter framework. Neither result is presented in this article.

Derivation and proof

Ahlswede and Winter

The Laplace transform argument found in is a significant result in its own right:Let

be a random self-adjoint matrix. Then

\Pr\left\{λ_max(Y)\geqt\right\}\leqinf_\theta\left\{e^-\theta ⋅ \operatorname{E}\left[\operatorname{tr}e^\theta\right]\right\}.

To prove this, fix

\theta>0

. Then

\begin{align}\Pr\left\{λ_max(Y)\geqt\right\}&=\Pr\left\{λ_max(\thetaY)\geq\thetat\right\}\\ &=\Pr\left\{

	λ_max(\thetaY)
e

\geqe^\theta\right\}\\ &\leqe^-\theta\operatorname{E}

	λ_max(\thetaY)
e

\\ &\leqe^-\theta\operatorname{E}\operatorname{tr}e^(\theta\end{align}

The second-to-last inequality is Markov's inequality. The last inequality holds since

	λ_max(\thetaY)
e

=λ_max(e^\theta)\leq\operatorname{tr}(e^\theta)

. Since the left-most quantity is independent of

\theta

, the infimum over

\theta>0

remains an upper bound for it.

Thus, our task is to understand

\operatorname{E}[\operatorname{tr}(e^\theta)]

Nevertheless, since trace and expectation are both linear, we can commute them, so it is sufficient to consider

\operatorname{E}e^\theta:=M_Y(\theta)

, which we call the matrix generating function. This is where the methods of and diverge. The immediately following presentation follows .

The Golden–Thompson inequality implies that

\operatorname{tr}

M
	X₁+X₂

(\theta)\leq\operatorname{tr}\left[\left(\operatorname{E}

	\thetaX₁
e

\right) \left(\operatorname{E}

	\thetaX₂
e

\right)\right]=\operatorname{tr}

M
	X₁

(\theta)

M
	X₂

(\theta)

, where we used the linearity of expectation several times.Suppose

Y=\sum_kX_k

. We can find an upper bound for

\operatorname{tr}M_Y(\theta)

by iterating this result. Noting that

\operatorname{tr}(AB)\leq\operatorname{tr}(A)λ_max(B)

, then

\operatorname{tr}M_Y(\theta)\leq\operatorname{tr}\left[\left(\operatorname{E}

	n-1
\sum		\thetaX_k
	k=1

\right)\left(\operatorname{E}

	\thetaX_n
e

\right)\right] \leq\operatorname{tr}\left(\operatorname{E}

	n-1
\sum		\thetaX_k
	k=1

\right)λ_max(\operatorname{E}

	\thetaX_n
e

Iterating this, we get

\operatorname{tr}M_Y(\theta)\leq(\operatorname{tr}I)\left[\Pi_kλ_max(\operatorname{E}

	\thetaX_k
e

)\right]= d

	\sum_kλ_max\left(log\operatorname{E
e

	\thetaX_k
e

\right)}

So far we have found a bound with an infimum over

\theta

. In turn, this can be bounded. At any rate, one can see how the Ahlswede–Winter bound arises as the sum of largest eigenvalues.

Tropp

The major contribution of is the application of Lieb's theorem where had applied the Golden–Thompson inequality. Tropp's corollary is the following: If

is a fixed self-adjoint matrix and

is a random self-adjoint matrix, then

\operatorname{E}\operatorname{tr}e^H+X\leq\operatorname{tr}e^He^X)}

Proof: Let

Y=e^X

. Then Lieb's theorem tells us that

f(Y)=\operatorname{tr}e^H

is concave.The final step is to use Jensen's inequality to move the expectation inside the function:

\operatorname{E}\operatorname{tr}e^H\leq\operatorname{tr}e^HY)}.

This gives us the major result of the paper: the subadditivity of the log of the matrix generating function.

Subadditivity of log mgf

Let

X_k

be a finite sequence of independent, random self-adjoint matrices. Then for all

\theta\inR

\operatorname{tr}

M
	\sum_kX_k

(\theta) \leq\operatorname{tr}

\sum

log

M
	X_k

(\theta)

Proof: It is sufficient to let

\theta=1

. Expanding the definitions, we need to show that

\operatorname{E}\operatorname{tr}

	\sum_k\thetaX_k
e

\leq\operatorname{tr}

	\sum_klog\operatorname{E
e

	\thetaX_k
e

To complete the proof, we use the law of total expectation. Let

\operatorname{E}_k

be the expectation conditioned on

X_1,\ldots,X_k

. Since we assume all the

X_i

are independent,

\operatorname{E}_k-1

	X_k
e

=\operatorname{E}

	X_k
e

Define

\Xi_k=log\operatorname{E}_k-1

	X_k
e

=log

M
	X_k

(\theta)

Finally, we have

\begin{align} \operatorname{E}\operatorname{tr}

	n
\sum		X_k
	k=1

&=\operatorname{E}₀ … \operatorname{E}_n-1\operatorname{tr}

	n-1
\sum		X_k+X_n
	k=1

\\ &\leq\operatorname{E}₀ … \operatorname{E}_n-2\operatorname{tr}

	n-1
\sum		X_k+log(\operatorname{E
	k=1

n-1

	X_n
e

)}\\ &=\operatorname{E}₀ … \operatorname{E}_n-2\operatorname{tr}

	n-2
\sum		X_k+X_n-1+\Xi_n
	k=1

\\ &\vdots\\ &=\operatorname{tr}

	n
\sum		\Xi_k
	k=1

\end{align}

where at every step m we use Tropp's corollary with

H_m=

	m-1
\sum
	k=1

X_k+

	n
\sum
	k=m+1

\Xi_k

Master tail bound

The following is immediate from the previous result:

\Pr\left\{λ_max\left(\sum_kX_k\right)\geqt\right\} \leqinf_\theta\left\{e^-\theta\operatorname{tr}

\sum

log

M
	X_k

(\theta)

\right\}

All of the theorems given above are derived from this bound; the theorems consist in various ways to bound the infimum. These steps are significantly simpler than the proofs given.

References

Ahlswede . R. . Winter . A. . 2003 . Strong Converse for Identification via Quantum Channels . 48 . 3 . 569–579 . . quant-ph/0012127 . 10.1109/18.985947 . 523176 .
Mackey. L.. Matrix Concentration Inequalities via the Method of Exchangeable Pairs. Jordan. M. I.. Chen. R. Y.. Farrell. B.. Tropp. J. A.. 2012. 1201.6002. 10.1214/13-AOP892. 42. The Annals of Probability. 3. 906–945. 9635314.
Magen. A. . Avner Magen. Zouzias. A.. Low-Rank Matrix-valued Chernoff Bounds and Approximate Matrix Multiplication. 2010. cs.DS . 1005.2724.
Oliveira . R.I. . 2010a. Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges. math.CO . 0911.0600.
Oliveira. R.I.. 2010b. Sums of random Hermitian matrices and an inequality by Rudelson. math.PR. 1004.3821.
Paulin . D. . Mackey . L. . Tropp . J. A.. 2013. Deriving Matrix Concentration Inequalities from Kernel Couplings. math.PR . 1305.0612.
Paulin . D. . Mackey . L. . Tropp . J. A. . Efron–Stein inequalities for random matrices . The Annals of Probability . 2016 . 44 . 5 . 3431–3473 . 10.1214/15-AOP1054 . 1408.3470 . 16263460 .
Rudelson. M.. Vershynin. R.. Sampling from large matrices: an approach through geometric functional analysis. Journal of the Association for Computing Machinery. 54. 4. 2007. 10.1145/1255443.1255449. math/9608208. 1996math......8208R. 6054789.
Tropp. J.. 2011. Freedman's inequality for matrix martingales. math.PR. 1101.3039.
Tropp . J. . 2010 . User-friendly tail bounds for sums of random matrices . 1004.4389 . 10.1007/s10208-011-9099-z . 12 . Foundations of Computational Mathematics . 4 . 389–434. 17735965 .

Notes and References

User-friendly tail bounds for sums of random matrices