stochastic variance reduced gradient (svrg)

This is a non-state-of-art read through of Stochastic Variance Reduced Gradient (SVRG) [1] method. Gradient descent and stochastic gradient descent (SGD) plays the most important role in optimization of machine learning problems. With large scale datasets, especially in deep learning applications, SGD and its variants maybe the only practical choice. This paper proposes to accelerate the convergence of SGD by reducing the variance of gradient, introduced by random sampling when evaluating gradient. This work has been extended to many other problems, such as non-convex optimization, sparse learning, etc. So this post is just a late bird note.

Gradient Descent and Stochastic Gradient Descent

In machine learning, we usually consider the following optimization problem, $$ min P(w), quad P(w) := frac{1}{n} sum_{i=1}^n psi_i(w), $$ where $w$ represents the model parameter and $psi_i(w)$ is a sequence of loss functions that evaluate the cost of current parameter $w$. Usually, $psi_i(w)$ depends on training data $(x_i, y_i)$ (supervised learning).

Examples

  • square loss: $psi(w) = (w^T x_i - y_i)^2$
  • log loss: $-y_i ln(sigma(w^T x_i)) - (1-y_i) ln(1- sigma(w^T x_i)) = ln(1 + exp(-y_i w^T x_i))$, $sigma(cdot)$ is the sigmoid function

Gradient Descent
Gradient descent is derived from the taylor expansion of a function. For smooth $P(w)$, in a small neighborhood of $w_0$, we have $$ P(w) approx P(w_0) + langle x - x_0, nabla f(x_0) rangle $$

Or along an arbitrary unit direction $d$, we have for small $t > 0$, $$ P(w_0 + td) approx P(w_0) + t nabla P(w_0)^T d $$
It turns out that for the negative $d = -nabla P(w_0)$ is a locally descent direction, that is, for small $t>0$, the following inequality holds, $$ P(w_0 - t nabla P(w_0)) approx P(w_0) - t | nabla P(w_0)|_2^2 leq P(w_0) $$

This leads the standard gradient descent method, $$ w_t = w_{t-1} - eta_t nabla P(w_{t-1}) $$
In our problem, we have $nabla P(w_{t-1}) = frac{1}{n} sum_{i=1}^n nabla psi_i(w_{t-1})$. I need another post to review the selection of step length $eta_t$ (aka. learning rate in machine learning), the convergence of gradient descent and recent theoretical results.

Stochastic Gradient Descent
At each step, gradient descent requires evaluations of $n$ derivatives. It is very expensive for large scale problems, which is common in machine leanring. A popular variant is SGD method with the following update rule, $$ w_t = w_{t-1} - eta_t nabla psi_{i_t}(w_{t-1}), $$
where $i_t$ is randomly drawn from ${1, 2, ldots, n }$ at each iteration $t$. We have $mathbb{E}[nabla psi_{i_t}(w_{t-1}) | w_{t-1}] = nabla P(w_{t-1})$.

Generalized form
The SGD updating rule can be further generalized, $$ w_t = w_{t-1} - eta_t g_t(w_{t-1}, xi_t) $$
where $xi_t$ is a random variable that may depend on $w_{t-1}$ and $g_t(cdot)$ is an approximate gradient. We only require it is an unbiased estimator, $nabla P(w_{t-1}) = mathbb{E}_{xi_t}[g_t(w_{t-1}, xi_t)|w_{t-1}]$. For example, the widely used mini-batch version, $$ g_t(w_{t-1}, xi_t) = frac{1}{m} sum_{i=1}^m nabla psi_{i_m}(w_{t-1}). $$

Actually, many algorithms use a biased estimator, especially in neural network training. I need another post to reivew the SGD variants and the theoretical results.

Why SVRG

Disadvantages of GD and SGD

Time Complexity learning rate Convergence Rate (Stronly Convex)
GD n gradient evaluataion const. O(log t)
SGD 1 gradient evaluation O(1/t) O(1/t)

As the stochastic gradient is just a random approximation (by a small batch of samples or even a single example) of the batch gradient, we must be careful when updating along the direction. In order to ensure convergence, the learning rate $eta_t$ has to decay to zero, due to the variance of random sampling. We generally choose $eta_t = O(1/t)$. Small learning rate leads to solwer sub-linear convergence rate of $O(1/t)$. We have a trade-off between the computation per iteration and convergence rate.

Fortunately, one can design methods that can reduce the variance of stochastic gradient. This may allow us to use a larger learning rate for SGD.

SVRG algorithm

The proposed SVRG algorithm updates by the following rule, $$
w_t = w_{t-1} - eta_t (nabla psi_{i_t}(w_{t-1}) - nabla psi_{i_t}(tilde{w}) + nabla P(tilde{w})), $$ where $tilde{w}$ is a snapshot that is updated every $m$ SGD iterations.

Intuitions
If the snapshot $tilde{w}$ is close to optima $w^ast$, $nabla psi_i(tilde{w}) rightarrow nabla psi_i(w^ast)$,

  • Let $tilde{mu} := nabla P(tilde{w})$, then $tilde{mu} - nabla P(w_{t-1}) approx nabla psi_i(tilde{w}) - nabla psi_i(w_{t-1})$. Intuitively, this updating rule cancel the randomness induced by random sampling $i$;
  • $tilde{mu} rightarrow 0$ when $tilde{w} rightarrow w^ast$. $nabla psi_i(w_{t-1}) - nabla psi_i(tilde{w}) + tilde{mu} rightarrow nabla psi_i(w_{t-1}) - nabla psi_i(w^ast) rightarrow 0$. The infinite small gradient allows to use constant learning rate. However, for SGD, $nabla psi_i(w_{t-1})$ may not converge to 0.

Algorithm
SVRG

Theorem

Suppose $gamma$-smooth $psi_i$ and $L$-strongly convex $P(w)$. Let $w^ast = arg min_w P(w)$. We have geometric convergence in expectation for SVRG: $$mathbb{E} P(tilde{w}_s) - mathbb{E} P(w^ast) leq alpha^s (P(tilde{w}_0) - P(w^ast))$$ given $m$ is sufficiently large, s.t., $$alpha = frac{1}{gamma eta m(1-2Leta)} + frac{2L eta}{1-2L eta} < 1$$

Actually, $m$ is of the same order of $n$ in the paper. Thus it might be more precise to say the following convergence rate, $$ mathbb{E} P(tilde{w}_s) - mathbb{E} P(w^ast) leq alpha^{t/n} (P(tilde{w}_0) - P(w^ast))$$

Summary

  • Randomness of SGD induces variance of gradient, which leads to decay learning rate and sub-linear convergence rate
  • Reducing the variance of stochastic gradient allows to use constant learning rate and obtains linear convergence in expectation
  • We donnot need to save the historical gradient $nabla psi_{i_1}(w_0), nabla psi_{i_2}(w_1), ldots$ in SVRG. However, the number of gradient evaluation per iteration increases
  • SVRG may be applied to non-strongly convex problem, leading a $O(1/T)$ convergence rate (standard SGD $O(1/sqrt{T})$)
  • SVRG can also be used to non-convex optimization problem, such as neural networks training

  1. Johnson, Rie, and Tong Zhang. “Accelerating stochastic gradient descent using predictive variance reduction.” In Advances in Neural Information Processing Systems, pp. 315-323. 2013.