Review | Random Features for Large-Scale Kernel Machines

Goal

Accelerate the computation of kernel matrix by directly approximating the kernel function.

Map x into a low-dimension random feature space, $z: mathcal { R } ^ { d } rightarrow mathcal { R } ^ { D }$

Basic Thought

Rewrite k(x, y) as an expectation over a certain probability space, and hence we can use sample mean to approximate it.

Method: Random Fourier Features

Some kernels depend only on $Delta = abs(x-y)$, for example, Gaussian kernel,

There is a theorem by Bochner: A continuous kernel $k ( mathrm { x } , mathrm { y } ) = k ( mathrm { x } - mathrm { y } )$ on $mathcal { R } ^ { d }$ is positive definite if and only if $k ( delta )$ is the Fourier transform of a non-negative measure.

And recall the Fourier Transform, $X ( j omega ) = int _ { - infty } ^ { + infty } x ( t ) e ^ { - j omega t } d t$, we let

And use the inverse formula:

By sampling different $omega$ from $p(omega)$, z(x) is a vector of D $zeta _ { omega } ( mathbf { x } )$, and

Plus, note that both $k()$ and $p(omega)$ are real-valued, so we could ignore the imaginary part,

After adding a random variable uniformly drawn from $[0, 2pi]$, (to avoid additionally calculating $sin(omega^{ T } x)$, since $cos(omega^{T} (x-y)) = cos(omega^{T} x) cos(omega^{T} y) + sin(omega^{T} x) sin(omega^{T} y)$)

So the final map is $z _ { omega } ( x ) = sqrt { 2 } cos left( omega ^ { T } x + b right)$

How close is the approximation to the original kernel function?

By Hoeffding inequality, (since the rv, i.e. the integral, is bounded)

Method: Random binning Features

First try to approximate a special “hat” kernel

Partition the real number line with a grid of pitch δ, and shift this grid randomly by an amount u drawn uniformly at random from [0,δ]. This grid partitions the real number line into intervals [u + nδ,u + (n + 1)δ] for all integers n.

Extend it to more general kernel function

And we can rewrite a kernel function depending only on the L1 distance between pairs of points,

the distribution p could be set as, $p ( delta ) = delta ddot { k } ( delta )$, which is guarenteed by the lemma 1 in this paper.

Random maps for separable multivariate shift-invariant kernels of the form can also utilize this method.