Sub-Gaussian distribution explained
In probability theory, a subgaussian distribution, the distribution of a subgaussian random variable, is a probability distribution with strong tail decay. More specifically, the tails of a subgaussian distribution are dominated by (i.e. decay at least as fast as) the tails of a Gaussian. This property gives subgaussian distributions their name.
Often in analysis, we divide an object (such as a random variable) into two parts, a central bulk and a distant tail, then analyze each separately. In probability, this division usually goes like "Everything interesting happens near the center. The tail event is so rare, we may safely ignore that." Subgaussian distributions are worthy of study, because the gaussian distribution is well-understood, and so we can give sharp bounds on the rarity of the tail event. Similarly, the subexponential distributions are also worthy of study.
Formally, the probability distribution of a random variable
is called subgaussian if there is a positive constant C
such that for every
,.There are many equivalent definitions. For example, a random variable
is sub-Gaussian iff its distribution function is bounded from above (up to a constant) by the distribution function of a Gaussian:P(|X|\geqt)\leqcP(|Z|\geqt) \forallt>0
where
is constant and
is a mean zero Gaussian random variable.
Definitions
Subgaussian norm
The subgaussian norm of
, denoted as
, is
In other words, it is the
Orlicz norm of
generated by the Orlicz function
By condition
below, subgaussian random variables can be characterized as those random variables with finite subgaussian norm.
Variance proxy
If there exists some
such that
\operatorname{E}[e(X-\operatorname{E[X])t}]\leq
for all
, then
is called a
variance proxy, and the smallest such
is called the
optimal variance proxy and denoted by
.
Since
\operatorname{E}[e(X-\operatorname{E[X])t}]=
when
is Gaussian, we then have
, as it should.
Equivalent definitions
Let
be a random variable. The following conditions are equivalent: (Proposition 2.5.2 [1])- Tail probability bound:
\operatorname{P}(|X|\geqt)\leq2
for all
, where
is a positive constant;
- Finite subgaussian norm:
.
- Moment:
\operatorname{E}|X|p\leq
+1\right)
for all
, where
is a positive constant and
is the
Gamma function.
- Moment:
\operatorname{E}|X|p\leqKppp/2
for all
,
- Moment-generating function (of
), or variance proxy : \operatorname{E}[e(X-\operatorname{E[X])t}]\leq
for all
, where
is a positive constant.- Moment-generating function (of
): for some
,
for all
.- Union bound: for some c > 0,
\operatorname{E}[max\{|X1-\operatorname{E}[X]|,\ldots,|Xn-\operatorname{E}[X]|\}]\leqc\sqrt{logn}
for all
n > c, where
are
i.i.d copies of
X.
- Subexponential:
has a subexponential distribution.Furthermore, the constant
is the same in the definitions (1) to (5), up to an absolute constant. So for example, given a random variable satisfying (1) and (2), the minimal constants
in the two definitions satisfy
, where
are constants independent of the random variable.
Proof of equivalence
As an example, the first four definitions are equivalent by the proof below.
Proof.
By the
layer cake representation,
After a change of variables
, we find that
By the
Taylor series which is less than or equal to
for
. Let
, then
By
Markov's inequality,
by asymptotic formula for gamma function:
\Gamma(p/2+1)\sim\sqrt{\pip}\left(
\right)p/2
.
From the proof, we can extract a cycle of three inequalities:
\operatorname{P}(|X|\geqt)\leq2\exp{(-t2/K2)}
, then
\operatorname{E}|X|p\leq2Kp\Gamma\left(
+1\right)
for all
.
\operatorname{E}|X|p\leq2Kp\Gamma\left(
+1\right)
for all
, then
.
, then
\operatorname{P}(|X|\geqt)\leq2\exp{(-t2/K2)}
.
In particular, the constant
provided by the definitions are the same up to a constant factor, so we can say that the definitions are equivalent up to a constant independent of
.
Similarly, because up to a positive multiplicative constant,
\Gamma(p/2+1)=pp/2 x ((2e)-1/2p1/2p)p
for all
, the definitions (3) and (4) are also equivalent up to a constant.
Basic properties
Proposition.
is subgaussian, and
, then
and
.
are subgaussian, then
.
Proposition. (
Chernoff bound) If
is subgaussian, then
for all
.
Definition.
means that
, where the positive constant
is independent of
and
.
Proposition. If
is subgaussian, then
.
Proof. By triangle inequality,
. Now we have
=\sqrt{ln2}|E[X]|\leq\sqrt{ln2}E[|X|]\simE[|X|]
. By the equivalence of definitions (2) and (4) of subgaussianity, given above, we have
.
Proposition. If
are subgaussian and independent, then
.
Proof. If independent, then use that the cumulant of independent random variables is additive. That is,
ln\operatorname{E}[et(X+Y)]=ln\operatorname{E}[etX]+ln\operatorname{E}[etY]
.
If not independent, then by Hölder's inequality, for any
we have
Solving the optimization problem
\begin{cases}
minp
+q
\\
1/p+1/q=1
\end{cases}
, we obtain the result.
Corollary. Linear sums of subgaussian random variables are subgaussian.
Strictly subgaussian
Expanding the cumulant generating function:we find that
. At the edge of possibility, we define that a random variable
satisfying
is called
strictly subgaussian.
Properties
Theorem.[2] Let
be a subgaussian random variable with mean zero. If all zeros of its
characteristic function are real, then
is strictly subgaussian.
Corollary. If
are independent and strictly subgaussian, then any linear sum of them is strictly subgaussian.
Examples
By calculating the characteristic functions, we can show that some distributions are strictly subgaussian: symmetric uniform distribution, symmetric Bernoulli distribution.
Since a symmetric uniform distribution is strictly subgaussian, its convolution with itself is strictly subgaussian. That is, the symmetric triangular distribution is strictly subgaussian.
Since the symmetric Bernoulli distribution is strictly subgaussian, any symmetric Binomial distribution is strictly subgaussian.
Examples
X\_ |
| X\ | _^2 | strictly subgaussian? |
---|
gaussian distribution
|
|
| Yes |
mean-zero Bernoulli distribution
| solution to
|
| Iff
|
symmetric Bernoulli distribution
|
} |
| Yes |
uniform distribution
| solution to
, approximately 0.7727 |
| Yes |
arbitrary distribution on interval
| |
| | |
The optimal variance proxy
is known for many standard probability distributions, including the beta, Bernoulli, Dirichlet, Kumaraswamy, triangular, truncated Gaussian, and truncated exponential.
Bernoulli distribution
Let
be two positive numbers. Let
be a centered Bernoulli distribution
, so that it has mean zero, then
.
[3] Its subgaussian norm is
where
is the unique positive solution to
.
Let
be a random variable with symmetric Bernoulli distribution (or Rademacher distribution). That is,
takes values
and
with probabilities
each. Since
, it follows thatand hence
is a subgaussian random variable.Bounded distributions
Bounded distributions have no tail at all, so clearly they are subgaussian.
If
is bounded within the interval
,
Hoeffding's lemma states that
\Vert
\leq\left(
\right)2
.
Hoeffding's inequality is the Chernoff bound obtained using this fact.
Convolutions
Since the sum of subgaussian random variables is still subgaussian, the convolution of subgaussian distributions is still subgaussian. In particular, any convolution of the normal distribution with any bounded distribution is subgaussian.
Mixtures
Given subgaussian distributions
, we can construct an additive mixture
as follows: first randomly pick a number
, then pick
.
Since
\operatorname{E}\left[\exp{\left( | X2 |
c2 |
\right)}\right]=\sumipi\operatorname{E}\left[\exp{\left(
\right)}\right]
we have
, and so the mixture is subgaussian.
In particular, any gaussian mixture is subgaussian.
More generally, the mixture of infinitely many subgaussian distributions is also subgaussian, if the subgaussian norm has a finite supremum:
.
Subgaussian random vectors
So far, we have discussed subgaussianity for real-valued random variables. We can also define subgaussianity for random vectors. The purpose of subgaussianity is to make the tails decay fast, so we generalize accordingly: a subgaussian random vector is a random vector where the tail decays fast.
Let
be a random vector taking values in
.
Define.
, where
is the unit sphere in
.
is subgaussian iff
.
Theorem. (Theorem 3.4.6) For any positive integer
, the uniformly distributed random vector
is subgaussian, with
.
This is not so surprising, because as
, the projection of
to the first coordinate converges in distribution to the standard normal distribution.
Maximum inequalities
Proposition. If
are mean-zero subgaussians, with
, then for any
, we have
max(X1,...,Xn)\leq\sigma\sqrt{2ln
} with probability
.
Proof. By the Chernoff bound,
Pr(Xi\geq\sigma\sqrt{2ln(n/\delta)})\leq\delta/n
. Now apply the
union bound.
Proposition. (Exercise 2.5.10) If
are subgaussians, with
, then
Further, the bound is sharp, since when
are IID samples of
we have
E\left[max1|Xn|\right]\gtrsim\sqrt{lnN}
.
[4] [5]
Theorem. (over a finite set) If
are subgaussian, with
, then
Theorem. (over a
convex polytope) Fix a finite set of vectors
. If
is a random vector, such that each
, then the above 4 inequalities hold, with
replacing
.
Here,
is the convex polytope spanned by the vectors
.
Theorem. (over a ball) If
is a random vector in
, such that
for all
on the unit sphere
, then
For any
, with probability at least
,
Inequalities
Theorem. (Theorem 2.6.1) There exists a positive constant
such that given any number of independent mean-zero subgaussian random variables
,
Theorem. (Hoeffding's inequality) (Theorem 2.6.3) There exists a positive constant
such that given any number of independent mean-zero subgaussian random variables
,
Theorem. (Bernstein's inequality) (Theorem 2.8.1) There exists a positive constant
such that given any number of independent mean-zero subexponential random variables
,
Theorem. (Khinchine inequality) (Exercise 2.6.5) There exists a positive constant
such that given any number of independent mean-zero variance-one subgaussian random variables
, any
, and any
,
Hanson-Wright inequality
The Hanson-Wright inequality states that if a random vector
is subgaussian in a certain sense, then any
quadratic form
of this vector,
, is also subgaussian/subexponential. Further, the upper bound on the tail of
, is
uniform.
A weak version of the following theorem was proved in (Hanson, Wright, 1971).[6] There are many extensions and variants. Much like the central limit theorem, the Hanson-Wright inequality is more a cluster of theorems with the same purpose, than a single theorem. The purpose is to take a subgaussian vector and uniformly bound its quadratic forms.
Theorem.[7] [8] There exists a constant
, such that:
Let
be a positive integer. Let
be independent random variables, such that each satisfies
. Combine them into a random vector
. For any
matrix
, we have
where
, and
is the Frobenius norm of the matrix, and
is the
operator norm of the matrix.
In words, the quadratic form
has its tail uniformly bounded by an exponential, or a gaussian, whichever is larger.
In the statement of the theorem, the constant
is an "absolute constant", meaning that it has no dependence on
. It is a mathematical constant much like
pi and
e.
Consequences
Theorem (subgaussian concentration). There exists a constant
, such that:
Let
be positive integers. Let
be independent random variables, such that each satisfies
. Combine them into a random vector
. For any
matrix
, we have
In words, the random vector
is concentrated on a spherical shell of radius
, such that
is subgaussian, with subgaussian norm
.
See also
References
- J.P. . Kahane . Propriétés locales des fonctions à séries de Fourier aléatoires . . 19 . 1–25 . 1960 . 10.4064/sm-19-1-1-25 . free .
- V.V. . Buldygin . Yu.V. . Kozachenko . Sub-Gaussian random variables . Ukrainian Mathematical Journal . 32 . 483–489 . 1980 . 6 . 10.1007/BF01087176 .
- Book: Ledoux . Michel . Talagrand . Michel . Probability in Banach Spaces . 1991 . Springer-Verlag .
- Book: Stromberg . K.R. . Probability for Analysts . 1994 . Chapman & Hall/CRC .
- A.E. . Litvak . A. . Pajor . M. . Rudelson . N. . Tomczak-Jaegermann . Smallest singular value of random matrices and geometry of random polytopes . . 195 . 491–523 . 2005 . 2 . 10.1016/j.aim.2004.08.004 . free .
- Mark . Rudelson . Roman . Vershynin . Non-asymptotic theory of random matrices: extreme singular values . Proceedings of the International Congress of Mathematicians 2010 . 1576–1602 . 1003.2990 . 10.1142/9789814324359_0111 . 2010 .
- News: O. . Rivasplata . Subgaussian random variables: An expository note . Unpublished . 2012 .
- Vershynin, R. (2018). "High-dimensional probability: An introduction with applications in data science" (PDF). Volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge.
- Zajkowskim, K. (2020). "On norms in some class of exponential type Orlicz spaces of random variables". Positivity. An International Mathematics Journal Devoted to Theory and Applications of Positivity. 24(5): 1231--1240. . .
Notes and References
- Book: Vershynin, R. . High-dimensional probability: An introduction with applications in data science . Cambridge University Press . 2018 . Cambridge.
- Bobkov . S. G. . Strictly subgaussian probability distributions . 2023-08-03 . 2308.01749 . Chistyakov . G. P. . Götze . F.. math.PR .
- Bobkov . S. G. . Strictly subgaussian probability distributions . 2023-08-03 . 2308.01749 . Chistyakov . G. P. . Götze . F.. math.PR .
- Kamath, Gautam. "Bounds on the expectation of the maximum of samples from a gaussian." (2015)
- Web site: MIT 18.S997 Spring 2015 High-Dimensional Statistics, Chapter 1. Sub-Gaussian Random Variables . 2024-04-03 . MIT OpenCourseWare . en.
- Hanson . D. L. . Wright . F. T. . 1971 . A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables . The Annals of Mathematical Statistics . 42 . 3 . 1079–1083 . 10.1214/aoms/1177693335 . 2240253 . 0003-4851. free .
- Rudelson . Mark . Vershynin . Roman . January 2013 . Hanson-Wright inequality and sub-gaussian concentration . Electronic Communications in Probability . 18 . none . 1–9 . 10.1214/ECP.v18-2865 . 1083-589X. 1306.2872 .
- Book: Vershynin, Roman . High-Dimensional Probability: An Introduction with Applications in Data Science . 2018 . Cambridge University Press . 978-1-108-41519-4 . Cambridge Series in Statistical and Probabilistic Mathematics . Cambridge . 6. Quadratic Forms, Symmetrization, and Contraction . 127–146 . 10.1017/9781108231596.009 . https://doi.org/10.1017/9781108231596.009.