In statistics, machine learning and algorithms, a tensor sketch is a type of dimensionality reduction that is particularly efficient when applied to vectors that have tensor structure.[1] [2] Such a sketch can be used to speed up explicit kernel methods, bilinear pooling in neural networks and is a cornerstone in many numerical linear algebra algorithms.[3]
Mathematically, a dimensionality reduction or sketching matrix is a matrix
M\inRk
k<d
x\inRd
|\|Mx\|2-\|x\|2|<\varepsilon\|x\|2
M
A tensor sketch has the extra property that if
x=y ⊗ z
d1 | |
y\inR |
,
d2 | |
z\inR |
d1d2=d
M(y ⊗ z)
⊗
The speedup is achieved by first rewriting
M(y ⊗ z)=M'y\circM''z
\circ
M'y
M''z
O(kd1)
O(kd2)
O(d1d2+kd1+kd2)
M(y ⊗ z)
O(kd)=O(kd1d2)
For higher-order tensors, such as
x=y ⊗ z ⊗ t
The term tensor sketch was coined in 2013[4] describing a technique by Rasmus Pagh[5] from the same year.Originally it was understood using the fast Fourier transform to do fast convolution of count sketches.Later research works generalized it to a much larger class of dimensionality reductions via Tensor random embeddings.
Tensor random embeddings were introduced in 2010 in a paper[6] on differential privacy and were first analyzed by Rudelson et al. in 2012 in the context of sparse recovery.[7]
Avron et al.[8] were the first to study the subspace embedding properties of tensor sketches, particularly focused on applications to polynomial kernels.In this context, the sketch is required not only to preserve the norm of each individual vector with a certain probability but to preserve the norm of all vectors in each individual linear subspace.This is a much stronger property, and it requires larger sketch sizes, but it allows the kernel methods to be used very broadly as explored in the book by David Woodruff.
The face-splitting product is defined as the tensor products of the rows (was proposed by V. Slyusar[9] in 1996[10] [11] [12] [13] [14] for radar and digital antenna array applications).More directly, let
C\inR3 x
D\inR3 x
C\bulletD
C\bullD =\left[ \begin{array}{c} C1 ⊗ D1\\\hlineC2 ⊗ D2\\\hline C3 ⊗ D3\\ \end{array} \right] = \left[ \begin{array}{ccccccccc} C1,1D1,1&C1,1D1,2&C1,1D1,3&C1,2D1,1&C1,2D1,2&C1,2D1,3&C1,3D1,1&C1,3D1,2&C1,3D1,3\\\hline C2,1D2,1&C2,1D2,2&C2,1D2,3&C2,2D2,1&C2,2D2,2&C2,2D2,3&C2,3D2,1&C2,3D2,2&C2,3D2,3\\\hline C3,1D3,1&C3,1D3,2&C3,1D3,3&C3,2D3,1&C3,2D3,2&C3,2D3,3&C3,3D3,1&C3,3D3,2&C3,3D3,3\end{array} \right].
(C\bullD)(x ⊗ y)=Cx\circDy =\left[ \begin{array}{c} (Cx)1(Dy)1\\ (Cx)2(Dy)2\\ \vdots \end{array}\right],
\circ
C\bullD
The tensor sketch of Pham and Pagh[4] computes
C(1)x\astC(2)y
C(1)
C(2)
\ast
C(x ⊗ y)
It turns out that this relation can be seen in terms of the face-splitting product as
C(1)x\astC(2)y=lF-1(lFC(1)x\circlFC(2)y)
lF
lF
lF-1
Cx
C\simlC(1)\bulletlC(2)
On the other hand,
lF(C(1)x\astC(2)y)=lFC(1)x\circlFC(2)y=(lFC(1)\bulllFC(2))(x ⊗ y)
The problem with the original tensor sketch algorithm was that it used count sketch matrices, which aren't always very good dimensionality reductions.
In 2020 it was shown that any matrices with random enough independent rows suffice to create a tensor sketch.This allows using matrices with stronger guarantees, such as real Gaussian Johnson Lindenstrauss matrices.
In particular, we get the following theorem
Consider a matrix
T
T1,...,Tm\inRd
2 | |
E[(T | |
2 |
p] | |
E[(T | |
1x) |
1/p\le\sqrt{ap}\|x\|2
T(1),...,T(c)
T
M=T(1)\bullet...\bulletT(c)
Then
|\|Mx\|2-\|x\|2|<\varepsilon\|x\|2
1-\delta
x
m=(4a)2c\varepsilon-2log1/\delta+(2ae)\varepsilon-1(log1/\delta)c
In particular, if the entries of
T
\pm1
m=O(\varepsilon-2log1/\delta+\varepsilon-1(\tfrac1clog1/\delta)c)
m=O(\varepsilon-2log1/\delta)
\varepsilon
The paper also shows that the dependency on
\varepsilon-1(\tfrac1clog1/\delta)c
Because of the exponential dependency on
c
M(x ⊗ y ⊗ … ) =M(1)(x ⊗ (M(2)y ⊗ … ))
We can achieve such an
M
M=M(c)(M(c-1) ⊗
(c-2) | |
I | |
d)(M |
⊗
I | |
d2 |
) … (M(1) ⊗
I | |
dc-1 |
)
With this method, we only apply the general tensor sketch method to order 2 tensors, which avoids the exponential dependency in the number of rows.
It can be proved that combining
c
\varepsilon
\sqrt{c}
The fast Johnson–Lindenstrauss transform is a dimensionality reduction matrix
Given a matrix
M\inRk x
Mx
kd
A version of this method takes
M=\operatorname{SHD}
D
Di,i
\pm1
Dx
O(d)
H
O(dlogd)
S
k x d
If the diagonal matrix is replaced by one which has a tensor product of
\pm1
\operatorname{SHD}(x ⊗ y)
For an example of this, let
\rho,\sigma\in\{-1,1\}2
\pm1
D
\rho ⊗ \sigma
\operatorname{SHD}(x ⊗ y)
\begin{align} &\operatorname{SHD}(x ⊗ y) \\ & = \begin{bmatrix} 1&0&0&0\\ 0&0&1&0\\ 0&1&0&0 \end{bmatrix} \begin{bmatrix} 1&1&1&1\\ 1&-1&1&-1\\ 1&1&-1&-1\\ 1&-1&-1&1 \end{bmatrix} \begin{bmatrix} \sigma1\rho1&0&0&0\\ 0&\sigma1\rho2&0&0\\ 0&0&\sigma2\rho1&0\\ 0&0&0&\sigma2\rho2\\ \end{bmatrix} \begin{bmatrix} x1y1\\ x2y1\\ x1y2\\ x2y2 \end{bmatrix} \\[5pt] & = \left(\begin{bmatrix} 1&0\\ 0&1\\ 1&0 \end{bmatrix} \bullet \begin{bmatrix} 1&0\\ 1&0\\ 0&1 \end{bmatrix} \right) \left(\begin{bmatrix} 1&1\\ 1&-1 \end{bmatrix} ⊗ \begin{bmatrix} 1&1\\ 1&-1 \end{bmatrix} \right) \left(\begin{bmatrix} \sigma1&0\\ 0&\sigma2\\ \end{bmatrix} ⊗ \begin{bmatrix} \rho1&0\\ 0&\rho2\\ \end{bmatrix} \right) \left(\begin{bmatrix} x1\\ x2 \end{bmatrix} ⊗ \begin{bmatrix} y1\\ y2 \end{bmatrix} \right) \\[5pt] & = \left(\begin{bmatrix} 1&0\\ 0&1\\ 1&0 \end{bmatrix} \bullet \begin{bmatrix} 1&0\\ 1&0\\ 0&1 \end{bmatrix} \right) \left(\begin{bmatrix} 1&1\\ 1&-1 \end{bmatrix} \begin{bmatrix} \sigma1&0\\ 0&\sigma2\\ \end{bmatrix} \begin{bmatrix} x1\\ x2 \end{bmatrix} ⊗ \begin{bmatrix} 1&1\\ 1&-1 \end{bmatrix} \begin{bmatrix} \rho1&0\\ 0&\rho2\\ \end{bmatrix} \begin{bmatrix} y1\\ y2 \end{bmatrix} \right) \\[5pt] & = \begin{bmatrix} 1&0\\ 0&1\\ 1&0 \end{bmatrix} \begin{bmatrix} 1&1\\ 1&-1 \end{bmatrix} \begin{bmatrix} \sigma1&0\\ 0&\sigma2\\ \end{bmatrix} \begin{bmatrix} x1\\ x2 \end{bmatrix} \circ \begin{bmatrix} 1&0\\ 1&0\\ 0&1 \end{bmatrix} \begin{bmatrix} 1&1\\ 1&-1 \end{bmatrix} \begin{bmatrix} \rho1&0\\ 0&\rho2\\ \end{bmatrix} \begin{bmatrix} y1\\ y2 \end{bmatrix} . \end{align}
In other words,
\operatorname{SHD}=S(1)HD(1)\bulletS(2)HD(2)
O(d1logd1+d2logd2)
d1d2log(d1d2)
The same approach can be extended to compute higher degree products, such as
\operatorname{SHD}(x ⊗ y ⊗ z)
Ahle et al. shows that if
\operatorname{SHD}
\varepsilon-2(log1/\delta)c+1
|\|\operatorname{SHD}x\|2-\|x\||\le\varepsilon\|x\|2
dc | |
x\inR |
1-\delta
c
Jin et al.,[16] the same year, showed a similar result for the more general class of matrices call RIP, which includes the subsampled Hadamard matrices.They showed that these matrices allow splitting into tensors provided the number of rows is
\varepsilon-2(log1/\delta)2c-1logd
c=2
These fast constructions can again be combined with the recursion approach mentioned above, giving the fastest overall tensor sketch.
It is also possible to do so-called "data aware" tensor sketching.Instead of multiplying a random matrix on the data, the data points are sampled independently with a certain probability depending on the norm of the point.[17]
Kernel methods are popular in machine learning as they give the algorithm designed the freedom to design a "feature space" in which to measure the similarity of their data points.A simple kernel-based binary classifier is based on the following computation:
\hat{y}(x')=sgn
n | |
\sum | |
i=1 |
yik(xi,x'),
where
d | |
x | |
i\inR |
yi
i
\hat{y}(x')
x'
k:Rd x Rd\toR
k(x,x')=
2) | |
\exp(-\|x-x'\| | |
2 |
k(x,x')=(1+\langlex,x'\rangle)2
When used this way, the kernel method is called "implicit".Sometimes it is faster to do an "explicit" kernel method, in which a pair of functions
f,g:Rd\toRD
k(x,x')=\langlef(x),g(x')\rangle
\hat{y}(x') =sgn
n | |
\sum | |
i=1 |
yi\langlef(xi),g(x')\rangle =sgn
n | |
\left\langle\left(\sum | |
i=1 |
yif(xi)\right),g(x')\right\rangle,
where the value
n | |
\sum | |
i=1 |
yif(xi)
The problem with this method is that the feature space can be very large. That is
D>>d
k(x,x')=\langlex,x'\rangle3
f(x)=x ⊗ x ⊗ x
g(x')=x' ⊗ x' ⊗ x'
⊗
f(x),g(x')\inRD
D=d3
d
D
n
The idea of tensor sketch is that we can compute approximate functions
f',g':Rd\toRt
t
d
\langlef'(x),g'(x')\rangle ≈ k(x,x')
This method was shown in 2020[18] to work even for high degree polynomials and radial basis function kernels.
Assume we have two large datasets, represented as matrices
X,Y\inRn
i,j
\langleXi,Yj\rangle
Z=XYT\inRn x
n2
n2
n2d
The idea of Compressed Matrix Multiplication is the general identity
XYT=
d | |
\sum | |
i=1 |
Xi ⊗ Yi
where
⊗
Xi ⊗ Yi
Bilinear pooling is the technique of taking two input vectors,
x,y
x ⊗ y
In[19] the authors considered using tensor sketch to reduce the number of variables needed.
In 2017 another paper[20] takes the FFT of the input features, before they are combined using the element-wise product.This again corresponds to the original tensor sketch.