In mathematics, the Johnson–Lindenstrauss lemma is a result named after William B. Johnson and Joram Lindenstrauss concerning low-distortion embeddings of points from high-dimensional into low-dimensional Euclidean space. The lemma states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. In the classical proof of the lemma, the embedding is a random orthogonal projection.
The lemma has applications in compressed sensing, manifold learning, dimensionality reduction, graph embedding, and natural language processing. Much of the data stored and manipulated on computers, including text and images, can be represented as points in a high-dimensional space (see vector space model for the case of text). However, the essential algorithms for working with such data tend to become bogged down very quickly as dimension increases.[1] It is therefore desirable to reduce the dimensionality of the data in a way that preserves its relevant structure.
The lemma likely explains how large language models (LLMs) like transformers are able to represent highly nuanced word meanings and context in relatively low-dimension embeddings.[2]
Given
0<\varepsilon<1
X
m
RN
n>8(lnm)/\varepsilon2
f:RN → Rn
(1-\varepsilon)\|u-v\|2\leq\|f(u)-f(v)\|2\leq(1+\varepsilon)\|u-v\|2
for all
u,v\inX
The formula can be rearranged:
Alternatively, for any
\epsilon\in(0,1)
n\ge15(lnm)/\varepsilon2
f:RN → Rn
f|X
(1+\varepsilon)
Also, the lemma is tight up to a constant factor, i.e. there exists a set of points of size m that needs dimension
\Omega\left(
log(m) | |
\varepsilon2 |
\right)
in order to preserve the distances between all pairs of points within a factor of
(1\pm\varepsilon)
The classical proof of the lemma takes
f
P
n
RN
X
c
P
f(v)=Pv/c
To obtain the projection algorithmically, it suffices with high probability to repeatedly sample orthogonal projection matrices at random. If you keep rolling the dice, you will eventually obtain one in polynomial random time.
A related lemma is the distributional JL lemma. This lemma states that for any
0<\varepsilon,\delta<1/2
d
Rk
A
k=O(\varepsilon-2log(1/\delta))
x\inRd
P(|\Vert
2-1|>\varepsilon)<\delta | |
Ax\Vert | |
2 |
One can obtain the JL lemma from the distributional version by setting
x=(u-v)/\|u-v\|2
\delta<1/n2
Given A, computing the matrix vector product takes
O(kd)
O(kd)
There are two major lines of work. The first, Fast Johnson Lindenstrauss Transform (FJLT), was introduced by Ailon and Chazelle in 2006.This method allows the computation of the matrix vector product in just
dlogd+k2+\gamma
\gamma>0
Another approach is to build a distribution supported over matrices that are sparse.[6] This method allows keeping only an
\varepsilon
kd\varepsilon
b
kb\varepsilon
dlogd
It is possible to combine two JL matrices by taking the so-called face-splitting product, which is defined as the tensor products of the rows (was proposed by V. Slyusar in 1996 for radar and digital antenna array applications).More directly, let
{C}\inR3 x
{D}\inR3 x
{C}\bullet{D}
{C}\bullet{D} =\left[ \begin{array}{c} {C}1 ⊗ {D}1\\\hline{C}2 ⊗ {D}2\\\hline {C}3 ⊗ {D}3\\ \end{array} \right].
JL matrices defined like this use fewer random bits, and can be applied quickly to vectors that have tensor structure, due to the following identity:
(C\bullD)(x ⊗ y)=Cx\circDy =\left[ \begin{array}{c} (Cx)1(Dy)1\\ (Cx)2(Dy)2\\ \vdots \end{array}\right]
\circ
In 2020 it was shown that if the matrices
C1,C2,...,Cc
\pm1
C1\bullet...\bulletCc
O(\epsilon-2log1/\delta+\epsilon-1(\tfrac1clog1/\delta)c)
For large
\epsilon
(log1/\delta)c
n>128(lnm)/(9\varepsilon2).
1/(1+\varepsilon)<\sqrt{1-3\varepsilon/4}
\sqrt{1+3\varepsilon/4}<\sqrt{1+\varepsilon}<1+\varepsilon
\varepsilon\in(0,1)
3\varepsilon/4
128/9<15.