Sub-Gaussian distribution explained

In probability theory, a subgaussian distribution, the distribution of a subgaussian random variable, is a probability distribution with strong tail decay. More specifically, the tails of a subgaussian distribution are dominated by (i.e. decay at least as fast as) the tails of a Gaussian. This property gives subgaussian distributions their name.

Often in analysis, we divide an object (such as a random variable) into two parts, a central bulk and a distant tail, then analyze each separately. In probability, this division usually goes like "Everything interesting happens near the center. The tail event is so rare, we may safely ignore that." Subgaussian distributions are worthy of study, because the gaussian distribution is well-understood, and so we can give sharp bounds on the rarity of the tail event. Similarly, the subexponential distributions are also worthy of study.

Formally, the probability distribution of a random variable

is called subgaussian if there is a positive constant C such that for every
t\geq0

,

$\operatorname(|X| \geq t) \leq 2 \exp$ .There are many equivalent definitions. For example, a random variable

is sub-Gaussian iff its distribution function is bounded from above (up to a constant) by the distribution function of a Gaussian:

P(|X|\geqt)\leqcP(|Z|\geqt) \forallt>0

where

c\ge0

is constant and

is a mean zero Gaussian random variable.

Definitions

Subgaussian norm

The subgaussian norm of

, denoted as

\VertX

\Vert
	\psi₂

, is

\Vert X \Vert_ = \inf\left\.

In other words, it is the Orlicz norm of

generated by the Orlicz function

	u²
\Phi(u)=e

-1.

By condition

(2)

below, subgaussian random variables can be characterized as those random variables with finite subgaussian norm.

Variance proxy

If there exists some

s²

such that

\operatorname{E}[e^{(X-\operatorname{E}[X])t}]\leq

	s^2t²
	2

for all

, then

s²

is called a variance proxy, and the smallest such

s²

is called the optimal variance proxy and denoted by

\Vert

	2
X\Vert
	vp

Since

\operatorname{E}[e^{(X-\operatorname{E}[X])t}]=

	\sigma²t²
	2

when

X\siml{N}(\mu,\sigma²⁾

is Gaussian, we then have

	2
\\|X\\|
	vp

=\sigma²

, as it should.

Equivalent definitions

Let

be a random variable. The following conditions are equivalent: (Proposition 2.5.2 ^[1])

Tail probability bound:

\operatorname{P}(|X|\geqt)\leq2

	2)}
\exp{(-t
	1

for all

t\geq0

, where

K₁

is a positive constant;

Finite subgaussian norm:

\VertX

\Vert
	\psi₂

=K₂<infty

Moment:

\operatorname{E}|X|^p\leq

	p
2K		\Gamma\left(
	3

	p
	2

+1\right)

for all p\geq1

, where

K₃

is a positive constant and

\Gamma

is the Gamma function.

Moment:

\operatorname{E}|X|^p\leqK^pp^p/2

for all

p\geq1

Moment-generating function (of

), or variance proxy :
\operatorname{E}[e^{(X-\operatorname{E}[X])t}]\leq

K^2t²
2
e

for all
t

, where
K

is a positive constant.

Moment-generating function (of

X²

): for some
K>0

,
X^2t²
\operatorname{E}[e

]\leq

K^2t²
e

for all
t\in[-1/K,+1/K]

.

Union bound: for some c > 0,

\operatorname{E}[max\{|X₁-\operatorname{E}[X]|,\ldots,|X_n-\operatorname{E}[X]|\}]\leqc\sqrt{logn}

for all n > c, where

X_1,\ldots,X_n

are i.i.d copies of X.

Subexponential:

X²

has a subexponential distribution.Furthermore, the constant

is the same in the definitions (1) to (5), up to an absolute constant. So for example, given a random variable satisfying (1) and (2), the minimal constants

K_1,K₂

in the two definitions satisfy

K₁\leqcK_2,K₂\leqc'K₁

, where

c,c'

are constants independent of the random variable.

Proof of equivalence

As an example, the first four definitions are equivalent by the proof below.

Proof.

(1)\implies(3)

By the layer cake representation,

\begin\operatorname |X|^p &= \int_0^\infty \operatorname(|X|^p \geq t) dt\\&= \int_0^\infty pt^\operatorname(|X| \geq t) dt\\&\leq 2\int_0^\infty pt^\exp\left(-\frac\right) dt\\\end

After a change of variables

	2
u=t
	1

, we find that

\begin\operatorname |X|^p &\leq 2K_1^p \frac\int_0^\infty u^e^ du\\&= 2K_1^p \frac\Gamma\left(\frac\right)\\&= 2K_1^p \Gamma\left(\frac+1\right).\end

(3)\implies(2)

By the Taylor series

e^x = 1 + \sum_^\infty \frac,

\begin\operatorname[\exp{(\lambda X^2)}] &= 1 + \sum_^\infty \frac\\&\leq 1 + \sum_^\infty \frac\\&= 1 + 2 \sum_^\infty \lambda^p K_3^\\&= 2 \sum_^\infty \lambda^p K_3^-1\\&= \frac-1 \quad\text\lambda K_3^2 <1,\end

which is less than or equal to

for

λ\leq

	2
3K
	3

. Let

K₂\geq

	1
	2

K₃

, then

\operatorname[\exp{(X^2/K_2^2)}] \leq 2.

(2)\implies(1)

By Markov's inequality,

\operatorname(|X|\geq t) = \operatorname\left(\exp\left(\frac\right) \geq \exp\left(\frac\right) \right) \leq \frac \leq 2 \exp\left(-\frac\right).

(3)\iff(4)

by asymptotic formula for gamma function:

\Gamma(p/2+1)\sim\sqrt{\pip}\left(

	p
	2e

\right)^p/2

From the proof, we can extract a cycle of three inequalities:

\operatorname{P}(|X|\geqt)\leq2\exp{(-t^2/K^2)}

, then

\operatorname{E}|X|^p\leq2K^p\Gamma\left(

	p
	2

+1\right)

for all

p\geq1

\operatorname{E}|X|^p\leq2K^p\Gamma\left(

	p
	2

+1\right)

for all

p\geq1

, then

\|X

\\|
	\psi₂

\leq

	1
	2

\|X

\\|
	\psi₂

\leqK

, then

\operatorname{P}(|X|\geqt)\leq2\exp{(-t^2/K^2)}

In particular, the constant

provided by the definitions are the same up to a constant factor, so we can say that the definitions are equivalent up to a constant independent of

Similarly, because up to a positive multiplicative constant,

\Gamma(p/2+1)=p^p/2 x ((2e)^-1/2p^1/2p)^p

for all

p\geq1

, the definitions (3) and (4) are also equivalent up to a constant.

Basic properties

Proposition.

is subgaussian, and

k>0

, then

\\|kX\\|
	\psi₂

\\|X\\|
	\psi₂

and

\|kX\|_vp=k\|X\|_vp

X,Y

are subgaussian, then

	2
\\|X+Y\\|
	vp

\leq(\|X\|_vp+\|Y\|_vp)²

.Proposition. (Chernoff bound) If

is subgaussian, then

Pr(X\geqt)\leq

t²

	2
2\\|X\\|
	vp

for all

t\geq0

Definition.

X\lesssimX'

means that

X\leqCX'

, where the positive constant

is independent of

and

Proposition. If

is subgaussian, then

\|X-

E[X]\\|
	\psi₂

\lesssim

\\|X\\|
	\psi₂

Proof. By triangle inequality,

\|X-

E[X]\\|
	\psi₂

\leq

\\|X\\|
	\psi₂

\\|E[X]\\|
	\psi₂

. Now we have

\\|E[X]\\|
	\psi₂

=\sqrt{ln2}|E[X]|\leq\sqrt{ln2}E[|X|]\simE[|X|]

. By the equivalence of definitions (2) and (4) of subgaussianity, given above, we have

E[|X|]\lesssim

\\|X\\|
	\psi₂

Proposition. If

X,Y

are subgaussian and independent, then

	2
\\|X+Y\\|
	vp

\leq

	2
\\|X\\|
	vp

	2
\\|Y\\|
	vp

Proof. If independent, then use that the cumulant of independent random variables is additive. That is,

ln\operatorname{E}[e^t(X+Y)]=ln\operatorname{E}[e^tX]+ln\operatorname{E}[e^tY]

If not independent, then by Hölder's inequality, for any

1/p+1/q=1

we have

E[e^{t(X+Y)}] = \|e^\|_1 \leq e^

Solving the optimization problem

\begin{cases} minp

	2
\\|X\\|
	vp

	2
\\|Y\\|
	vp

\\ 1/p+1/q=1 \end{cases}

, we obtain the result.

Corollary. Linear sums of subgaussian random variables are subgaussian.

Strictly subgaussian

Expanding the cumulant generating function: $\frac 12 s^2 t^2 \geq \ln \operatorname[e^{tX}] = \frac 12 \mathrm[X] t^2 + \kappa_3 t^3 + \cdots$ we find that

Var[X]\leq

	2
\\|X\\|
	vp

. At the edge of possibility, we define that a random variable

satisfying

	2
Var[X]=\\|X\\|
	vp

is called strictly subgaussian.

Properties

Theorem.^[2] Let

be a subgaussian random variable with mean zero. If all zeros of its characteristic function are real, then

is strictly subgaussian.

Corollary. If

X_1,...,X_n

are independent and strictly subgaussian, then any linear sum of them is strictly subgaussian.

Examples

By calculating the characteristic functions, we can show that some distributions are strictly subgaussian: symmetric uniform distribution, symmetric Bernoulli distribution.

Since a symmetric uniform distribution is strictly subgaussian, its convolution with itself is strictly subgaussian. That is, the symmetric triangular distribution is strictly subgaussian.

Since the symmetric Bernoulli distribution is strictly subgaussian, any symmetric Binomial distribution is strictly subgaussian.

Examples

_^2

strictly subgaussian?

gaussian distribution

lN(0,1)

\sqrt{8/3}

Yes

mean-zero Bernoulli distribution

p\delta_q+q\delta_-p

solution to

	(q/t)²
pe

	(p/t)²
qe

	p-q
	2(logp-logq)

Iff

p=0,1/2,1

symmetric Bernoulli distribution

	12
	\delta

_1/2+

	12
	\delta

_-1/2

	1
	2\sqrt{ln2

}

1/4

Yes

uniform distribution

U(0,1)

solution to

	1
\int
	0

	x^2/t²
e

dx=2

, approximately 0.7727

1/3

Yes

arbitrary distribution on interval

[a,b]

\leq\left(

	b-a
	2

\right)²

The optimal variance proxy

\Vert

	2
X\Vert
	vp

is known for many standard probability distributions, including the beta, Bernoulli, Dirichlet, Kumaraswamy, triangular, truncated Gaussian, and truncated exponential.

Bernoulli distribution

Let

p+q=1

be two positive numbers. Let

be a centered Bernoulli distribution

p\delta_q+q\delta_-p

, so that it has mean zero, then

\Vert

	2
X\Vert
	vp

	p-q
	2(logp-logq)

.^[3] Its subgaussian norm is

where

is the unique positive solution to

	(q/t)²
pe

	(p/t)²
qe

Let

be a random variable with symmetric Bernoulli distribution (or Rademacher distribution). That is,

takes values
-1

and
1

with probabilities
1/2

each. Since

X²⁼¹

, it follows that $\Vert X \Vert_ = \inf\left\ = \inf\left\=\frac,$ and hence

is a subgaussian random variable.

Bounded distributions

Bounded distributions have no tail at all, so clearly they are subgaussian.

is bounded within the interval

[a,b]

, Hoeffding's lemma states that

\Vert

	2
X\Vert
	vp

\leq\left(

	b-a
	2

\right)²

. Hoeffding's inequality is the Chernoff bound obtained using this fact.

Convolutions

Since the sum of subgaussian random variables is still subgaussian, the convolution of subgaussian distributions is still subgaussian. In particular, any convolution of the normal distribution with any bounded distribution is subgaussian.

Mixtures

Given subgaussian distributions

X_1,X_2,...,X_n

, we can construct an additive mixture

as follows: first randomly pick a number

i\in\{1,2,...,n\}

, then pick

X_i

Since

\operatorname{E}\left[\exp{\left(	X²
	c²

\right)}\right]=\sum_ip_i\operatorname{E}\left[\exp{\left(

	2
X
	i

c²

\right)}\right]

we have

\\|X\\|
	\psi₂

\leqmax_i\|X_i\|


	\psi₂

, and so the mixture is subgaussian.

In particular, any gaussian mixture is subgaussian.

More generally, the mixture of infinitely many subgaussian distributions is also subgaussian, if the subgaussian norm has a finite supremum:

\\|X\\|
	\psi₂

\leq\sup_i\|X_i\|


	\psi₂

Subgaussian random vectors

So far, we have discussed subgaussianity for real-valued random variables. We can also define subgaussianity for random vectors. The purpose of subgaussianity is to make the tails decay fast, so we generalize accordingly: a subgaussian random vector is a random vector where the tail decays fast.

Let

be a random vector taking values in

\Rⁿ

Define.

\\|X\\|
	\psi₂

\sup
	v\inS^n-1

\|v^T

X\\|
	\psi₂

, where

S^n-1

is the unit sphere in

\Rⁿ

is subgaussian iff

\\|X\\|
	\psi₂

<infty

Theorem. (Theorem 3.4.6) For any positive integer

, the uniformly distributed random vector

X\simU(\sqrt{n}S^n-1)

is subgaussian, with

\\|X\\|
	\psi₂

\lesssim{}1

This is not so surprising, because as

n\toinfty

, the projection of

U(\sqrt{n}S^n-1)

to the first coordinate converges in distribution to the standard normal distribution.

Maximum inequalities

Proposition. If

X_1,...,X_n

are mean-zero subgaussians, with

\|X_i

	2
\\|
	vp

\leq\sigma²

, then for any

\delta>0

, we have

max(X_1,...,X_n)\leq\sigma\sqrt{2ln

	n
	\delta

} with probability

\geq1-\delta

Proof. By the Chernoff bound,

Pr(X_i\geq\sigma\sqrt{2ln(n/\delta)})\leq\delta/n

. Now apply the union bound.

Proposition. (Exercise 2.5.10) If

X_1,X_2,...

are subgaussians, with

\|X_i

\\|
	\psi₂

\leqK

, then

E\left[\sup_n \frac{|X_n|}{\sqrt{1+\ln n}}\right] \lesssim K, \quad E\left[\max_{1 \leq n \leq N} |X_n|\right] \lesssim K \sqrt

Further, the bound is sharp, since when

X_1,X_2,...

are IID samples of

lN(0,1)

we have

E\left[max₁|X_n|\right]\gtrsim\sqrt{lnN}

.^[4]

^[5]

Theorem. (over a finite set) If

X_1,...,X_n

are subgaussian, with

\|X_i

	2
\\|
	vp

\leq\sigma²

, then

\beginE[\max_i (X_i - E[X_i])] \leq \sigma\sqrt, &\quad P(\max_i X_i > t) \leq n e^, \\E[\max_i |X_i - E[X_i]|] \leq \sigma\sqrt,&\quadP(\max_i |X_i| > t) \leq 2 n e^\end

Theorem. (over a convex polytope) Fix a finite set of vectors

v_1,...,v_n

. If

is a random vector, such that each

	T
v
	i

	2
\\|
	vp

\leq\sigma²

, then the above 4 inequalities hold, with

max
	v\inconv(v_1,...,v_n)

v^TX

replacing

max_iX_i

Here,

conv(v_1,...,v_n)

is the convex polytope spanned by the vectors

v_1,...,v_n

Theorem. (over a ball) If

is a random vector in

\R^d

, such that

\|v^T

	2
X\\|
	vp

\leq\sigma²

for all

on the unit sphere

, then

E[\max_{v \in S} v^T X] = E[\max_{v \in S} |v^T X|] \leq 4\sigma \sqrt

For any

\delta>0

, with probability at least

1-\delta

\max_ v^T X = \max_ | v^T X | \leq 4 \sigma \sqrt+2 \sigma \sqrt

Inequalities

Theorem. (Theorem 2.6.1) There exists a positive constant

such that given any number of independent mean-zero subgaussian random variables

X_1,...,X_n

\left\|\sum_^n X_i\right\|_^2 \leq C \sum_^n\left\|X_i\right\|_^2

Theorem. (Hoeffding's inequality) (Theorem 2.6.3) There exists a positive constant

such that given any number of independent mean-zero subgaussian random variables

X_1,...,X_N

\mathbb\left(\left|\sum_^N X_i\right| \geq t\right) \leq 2 \exp \left(-\frac\right)\quad \forall t > 0

Theorem. (Bernstein's inequality) (Theorem 2.8.1) There exists a positive constant

such that given any number of independent mean-zero subexponential random variables

X_1,...,X_N

\mathbb\left(\left|\sum_^N X_i\right| \geq t\right) \leq 2 \exp \left(-c \min \left(\frac, \frac\right)\right)

Theorem. (Khinchine inequality) (Exercise 2.6.5) There exists a positive constant

such that given any number of independent mean-zero variance-one subgaussian random variables

X_1,...,X_N

, any

p\geq2

, and any

a_1,...,a_N\in\R

\left(\sum_^N a_i^2\right)^ \leq\left\|\sum_^N a_i X_i\right\|_ \leq C K \sqrt\left(\sum_^N a_i^2\right)^

Hanson-Wright inequality

The Hanson-Wright inequality states that if a random vector

is subgaussian in a certain sense, then any quadratic form

of this vector,

X^TAX

, is also subgaussian/subexponential. Further, the upper bound on the tail of

X^TAX

, is uniform.

A weak version of the following theorem was proved in (Hanson, Wright, 1971).^[6] There are many extensions and variants. Much like the central limit theorem, the Hanson-Wright inequality is more a cluster of theorems with the same purpose, than a single theorem. The purpose is to take a subgaussian vector and uniformly bound its quadratic forms.

Theorem.^[7] ^[8] There exists a constant

, such that:

Let

be a positive integer. Let

X_1,...,X_n

be independent random variables, such that each satisfies

E[X_i]=0

. Combine them into a random vector

X=(X_1,...,X_n)

. For any

n x n

matrix

, we have

P(|X^T AX - E[X^TAX]| > t) \leq \max\left(2 e^, 2 e^ \right) = 2 \exp \left[-c \min \left(\frac{t^2}{K^4\|A\|_F^2}, \frac{t}{K^2\|A\|}\right)\right]

where

K=max_i\|X_i\|


	\psi₂

, and

\|A\|_F=\sqrt{\sum_ij

	2}
A
	ij

is the Frobenius norm of the matrix, and

\|A\|=

max
	\\|x\\|₂₌₁

\|Ax\|₂

is the operator norm of the matrix.

In words, the quadratic form

X^TAX

has its tail uniformly bounded by an exponential, or a gaussian, whichever is larger.

In the statement of the theorem, the constant

is an "absolute constant", meaning that it has no dependence on

n,X_1,...,X_n,A

. It is a mathematical constant much like pi and e.

Consequences

Theorem (subgaussian concentration). There exists a constant

, such that:

Let

n,m

be positive integers. Let

X_1,...,X_n

be independent random variables, such that each satisfies

E[X_i]=0,

	2]
E[X
	i

. Combine them into a random vector

X=(X_1,...,X_n)

. For any

m x n

matrix

, we have

P(| \| AX\|_2 - \|A\|_F | > t) \leq 2 e^

In words, the random vector

is concentrated on a spherical shell of radius

\|A\|_F

, such that

\|AX\|₂-\|A\|_F

is subgaussian, with subgaussian norm

\leq\sqrt{3/c}\|A\|K²

References

J.P. . Kahane . Propriétés locales des fonctions à séries de Fourier aléatoires . . 19 . 1–25 . 1960 . 10.4064/sm-19-1-1-25 . free .
V.V. . Buldygin . Yu.V. . Kozachenko . Sub-Gaussian random variables . Ukrainian Mathematical Journal . 32 . 483–489 . 1980 . 6 . 10.1007/BF01087176 .
Book: Ledoux . Michel . Talagrand . Michel . Probability in Banach Spaces . 1991 . Springer-Verlag .
Book: Stromberg . K.R. . Probability for Analysts . 1994 . Chapman & Hall/CRC .
A.E. . Litvak . A. . Pajor . M. . Rudelson . N. . Tomczak-Jaegermann . Smallest singular value of random matrices and geometry of random polytopes . . 195 . 491–523 . 2005 . 2 . 10.1016/j.aim.2004.08.004 . free .
Mark . Rudelson . Roman . Vershynin . Non-asymptotic theory of random matrices: extreme singular values . Proceedings of the International Congress of Mathematicians 2010 . 1576–1602 . 1003.2990 . 10.1142/9789814324359_0111 . 2010 .
News: O. . Rivasplata . Subgaussian random variables: An expository note . Unpublished . 2012 .
Vershynin, R. (2018). "High-dimensional probability: An introduction with applications in data science" (PDF). Volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge.
Zajkowskim, K. (2020). "On norms in some class of exponential type Orlicz spaces of random variables". Positivity. An International Mathematics Journal Devoted to Theory and Applications of Positivity. 24(5): 1231--1240. . .

Notes and References

Book: Vershynin, R. . High-dimensional probability: An introduction with applications in data science . Cambridge University Press . 2018 . Cambridge.
Bobkov . S. G. . Strictly subgaussian probability distributions . 2023-08-03 . 2308.01749 . Chistyakov . G. P. . Götze . F.. math.PR .
Bobkov . S. G. . Strictly subgaussian probability distributions . 2023-08-03 . 2308.01749 . Chistyakov . G. P. . Götze . F.. math.PR .
Kamath, Gautam. "Bounds on the expectation of the maximum of samples from a gaussian." (2015)
Web site: MIT 18.S997 Spring 2015 High-Dimensional Statistics, Chapter 1. Sub-Gaussian Random Variables . 2024-04-03 . MIT OpenCourseWare . en.
Hanson . D. L. . Wright . F. T. . 1971 . A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables . The Annals of Mathematical Statistics . 42 . 3 . 1079–1083 . 10.1214/aoms/1177693335 . 2240253 . 0003-4851. free .
Rudelson . Mark . Vershynin . Roman . January 2013 . Hanson-Wright inequality and sub-gaussian concentration . Electronic Communications in Probability . 18 . none . 1–9 . 10.1214/ECP.v18-2865 . 1083-589X. 1306.2872 .
Book: Vershynin, Roman . High-Dimensional Probability: An Introduction with Applications in Data Science . 2018 . Cambridge University Press . 978-1-108-41519-4 . Cambridge Series in Statistical and Probabilistic Mathematics . Cambridge . 6. Quadratic Forms, Symmetrization, and Contraction . 127–146 . 10.1017/9781108231596.009 . https://doi.org/10.1017/9781108231596.009.

Sub-Gaussian distribution explained

Definitions

Subgaussian norm

Variance proxy

Equivalent definitions

Proof of equivalence

Basic properties

Strictly subgaussian

Properties

Examples

Examples

Bernoulli distribution

Bounded distributions

Convolutions

Mixtures

Subgaussian random vectors

Maximum inequalities

Inequalities

Hanson-Wright inequality

Consequences

See also

References

Notes and References