mathematics behind backpropagation in neural network

Ufldl introduces a provement for backpropagation in neural network. Let’s summarize here and include sparsity penalty.

What’s $delta_i^{(l)}$

Definition
$delta_i^{(l)} = frac{partial}{partial z_i^{(l)}}J(W,b)$
where $l$ is $lth$ layer in neural network, and $i$ is $ith$ node in this $lth$ layer.
In vector form, we can define $delta^{(l)} = frac{partial}{partial z^{(l)}}J(W,b)$
So we calculate in vector form (ignore subscript $i$) to simplify.
Firstly we have the following equation by definition,
$z^{(l)} = a^{(l-1)}cdot W^{(l-1)} + b^{(l-1)}$, where $l = 2, 3, …, L$
$a^{(l)} = f(z^{(l)})$, where $f$ is signal function.
Typically we choose sigmoid function to be $f$.
$f(z)=frac{1}{1+e^{-z}}$
Sigmoid function has this property.
$f’(z)=(1-f(z))f(z)$

Backpropagation without sparsity

Cost function : $J(W,b) = frac{1}{2}(y-h_{W,b}(x))^2$
Let’s define $L$ is the output layer.

$delta^{(L)} = frac{partial}{partial z^{(L)}}J(W,b) =
frac{partial}{partial z^{(L)}}frac{1}{2}(y-h_{W,b}(x))^2$
$= frac{partial}{partial z^{(L)}}frac{1}{2}(y - a^{(L)})^2
= frac{partial}{partial z^{(L)}}frac{1}{2}(y - f(z^{(L)}))^2$
$= -(y - f(z^{(L)}))cdot f’(z^{(L)})$

For $l = L-1, L-2, …, 2,$ we calculates $delta^{(l)} = ((W^{(l)})^Tdelta^{(l+1)}) .* f’(z^{(l)})$
Assume it is true for $l$ layer,

For $l-1$,
$delta^{(l-1)} = frac{partial}{partial z^{(l-1)}}J(W,b) =frac{partial}{partial z^{(l)}}J(W,b)cdot frac{partial}{partial z^{(l-1)}} z^{(l)} $
$= delta^{(l)}cdot frac{partial}{partial z^{(l-1)}} z^{(l)}
= delta^{(l)}cdot frac{partial}{partial z^{(l-1)}}(a^{(l-1)}W^{(l-1)}+b^{(l-1)})$
$=delta^{(l)}W^{(l-1)}cdot frac{partial}{partial z^{(l-1)}}a^{(l-1)}
= delta^{(l)}W^{(l-1)}cdot frac{partial}{partial z^{(l-1)}}f(z^{(l-1)})$
$=delta^{(l)}W^{(l-1)}cdot f’(z^{(l-1)})$
By mathematics conduction, prove done.

So we have
$delta^{(l)} = frac{partial}{partial z^{(L)}}J(W,b)=((W^{(l)})^Tdelta^{(l+1)}) .* f’(z^{(l)})$
Backpropagation uses a dynamic programming method to calculate $delta^{(l)}$ to improve the performance.
For partitial derivative of our parameters.
$frac{partial}{partial W^{(l)}}J(W,b) = frac{partial}{partial z^{(l+1)}}J(W,b)cdot frac{partial}{partial W^{(l)}}z^{(l+1)}$
$=delta^{(l+1)}cdot frac{partial}{partial W^{(l)}}z^{(l+1)}
=delta^{(l+1)}cdot frac{partial}{partial W^{(l)}}(a^{(l)}W^{(l)}+b^{(l)})$
$=delta^{(l+1)}a^{(l)}$

$frac{partial}{partial b^{(l)}}J(W,b) = frac{partial}{partial z^{(l+1)}}J(W,b)cdot frac{partial}{partial b^{(l)}}z^{(l+1)}$
$=delta^{(l+1)}cdot frac{partial}{partial b^{(l)}}z^{(l+1)}
=delta^{(l+1)}cdot frac{partial}{partial b^{(l)}}(a^{(l)}W^{(l)}+b^{(l)})$
$=delta^{(l+1)}$

Backpropagation with sparsity

The following conduction only works on the neural network with one single hidden layer. I do not know how it works in other scenarios. Leave it to be an open issue.
$rho_j^{(2)}=frac{1}{m}sum_{i=1}^m(a_j^{(2)})$ for all data
$S_l$ is number of node in layer $l$.
Sparsity pernality
$KL_j^{(2)} = rho logfrac{rho}{rho_j^{(2)}}+(1-rho) logfrac{1-rho}{1-rho_j^{(2)}}$
$k’(x)=(alogfrac{a}{x}+(1-a)logfrac{1-a}{1-x})’=-frac{a}{x}+frac{1-a}{1-x}$
For overal cost function
$J_{sparse}(W,b) = J(W,b) + betasum_{j=1}^{S_2}KL_j^{(2)}$
Let’s calculate
$frac{partial}{partial w_{i,k}^{(1)}}betasum_{j=1}^{S_2}KL_j^{(2)} $
$=frac{partial}{partial rho_{k}^{(1)}}betasum_{j=1}^{S_2}KL_j^{(2)}cdot frac{partial}{partial w_{i,k}^{(1)}}rho_{k}^{(2)}$
$=betasum_{j=1}^{S_2}frac{partial}{partial rho_{k}^{(2)}}(rho logfrac{rho}{rho_j^{(2)}}+(1-rho) logfrac{1-rho}{1-rho_j^{(2)}})cdot frac{partial}{partial w_{i,k}^{(1)}}rho_{k}^{(2)}$
$=beta(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdotfrac{partial}{partial w_{i,k}^{(1)}}rho_{k}^{(2)}$
$=beta(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdotfrac{partial}{partial w_{i,k}^{(1)}}frac{1}{m}sum_{i=1}^m(a_k^{(2)})$
$=frac{beta}{m}(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdotfrac{partial}{partial z_{k}^{(1)}}sum_{j=1}^m(a_k^{(2)})cdot frac{partial}{partial w_{i,k}^{(1)}}z_k^{(1)}$
$=frac{beta}{m}(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdotfrac{partial}{partial z_{k}^{(1)}}sum_{j=1}^m(a_k^{(2)})cdot frac{partial}{partial w_{i,k}^{(1)}}z_k^{(1)}$
$=frac{beta}{m}(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdot sum_{j=1}^mf’(z_k^{(1)})cdot frac{partial}{partial w_{i,k}^{(1)}}z_k^{(1)}$
$=frac{beta}{m}(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdot sum_{j=1}^mf’(z_k^{(1)})cdot a_i^{(1)}$
$=frac{beta}{m}sum_{j=1}^m(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdot f’(z_k^{(1)})cdot a_i^{(1)}$
Similarily,
$frac{partial}{partial b_{k}^{(1)}}betasum_{j=1}^{S_2}KL_j^{(2)} = =frac{beta}{m}sum_{j=1}^m(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdot f’(z_k^{(1)})$
So by backpropagation conduction, we can calculate
$delta_k^{(2)}+=beta(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdot f’(z_k^{(1)})$
But $delta_k^{(2)} not= frac{partial}{partial z_k^{(2)}}J_{sparse}(W,b)$

Open Questions for Sparsity Penalty

  • Why is Sparsity Penalty important? How does it work?
  • Does Sparsity Penalty works for neural network with one single hidden layer?
  • If not, how does Sparsity Penalty work for neural network with more than one hidden layers? How to implement the backpropagation algorithm ?