mathematics behind backpropagation in neural network

Ufldl introduces a provement for backpropagation in neural network. Let’s summarize here and include sparsity penalty.

What’s $delta_i^{(l)}$

Definition
$delta_i^{(l)} = frac{partial}{partial z_i^{(l)}}J(W,b)$
where $l$ is $lth$ layer in neural network, and $i$ is $ith$ node in this $lth$ layer.
In vector form, we can define $delta^{(l)} = frac{partial}{partial z^{(l)}}J(W,b)$
So we calculate in vector form (ignore subscript $i$) to simplify.
Firstly we have the following equation by definition,
$z^{(l)} = a^{(l-1)}cdot W^{(l-1)} + b^{(l-1)}$, where $l = 2, 3, …, L$
$a^{(l)} = f(z^{(l)})$, where $f$ is signal function.
Typically we choose sigmoid function to be $f$.
$f(z)=frac{1}{1+e^{-z}}$
Sigmoid function has this property.
$f’(z)=(1-f(z))f(z)$

Backpropagation without sparsity

Cost function : $J(W,b) = frac{1}{2}(y-h_{W,b}(x))^2$
Let’s define $L$ is the output layer.

$delta^{(L)} = frac{partial}{partial z^{(L)}}J(W,b) =
frac{partial}{partial z^{(L)}}frac{1}{2}(y-h_{W,b}(x))^2$
$= frac{partial}{partial z^{(L)}}frac{1}{2}(y - a^{(L)})^2
= frac{partial}{partial z^{(L)}}frac{1}{2}(y - f(z^{(L)}))^2$
$= -(y - f(z^{(L)}))cdot f’(z^{(L)})$

For $l = L-1, L-2, …, 2,$ we calculates $delta^{(l)} = ((W^{(l)})^Tdelta^{(l+1)}) .* f’(z^{(l)})$
Assume it is true for $l$ layer,

For $l-1$,
$delta^{(l-1)} = frac{partial}{partial z^{(l-1)}}J(W,b) =frac{partial}{partial z^{(l)}}J(W,b)cdot frac{partial}{partial z^{(l-1)}} z^{(l)} $
$= delta^{(l)}cdot frac{partial}{partial z^{(l-1)}} z^{(l)}
= delta^{(l)}cdot frac{partial}{partial z^{(l-1)}}(a^{(l-1)}W^{(l-1)}+b^{(l-1)})$
$=delta^{(l)}W^{(l-1)}cdot frac{partial}{partial z^{(l-1)}}a^{(l-1)}
= delta^{(l)}W^{(l-1)}cdot frac{partial}{partial z^{(l-1)}}f(z^{(l-1)})$
$=delta^{(l)}W^{(l-1)}cdot f’(z^{(l-1)})$
By mathematics conduction, prove done.

So we have
$delta^{(l)} = frac{partial}{partial z^{(L)}}J(W,b)=((W^{(l)})^Tdelta^{(l+1)}) .* f’(z^{(l)})$
Backpropagation uses a dynamic programming method to calculate $delta^{(l)}$ to improve the performance.
For partitial derivative of our parameters.
$frac{partial}{partial W^{(l)}}J(W,b) = frac{partial}{partial z^{(l+1)}}J(W,b)cdot frac{partial}{partial W^{(l)}}z^{(l+1)}$
$=delta^{(l+1)}cdot frac{partial}{partial W^{(l)}}z^{(l+1)}
=delta^{(l+1)}cdot frac{partial}{partial W^{(l)}}(a^{(l)}W^{(l)}+b^{(l)})$
$=delta^{(l+1)}a^{(l)}$

$frac{partial}{partial b^{(l)}}J(W,b) = frac{partial}{partial z^{(l+1)}}J(W,b)cdot frac{partial}{partial b^{(l)}}z^{(l+1)}$
$=delta^{(l+1)}cdot frac{partial}{partial b^{(l)}}z^{(l+1)}
=delta^{(l+1)}cdot frac{partial}{partial b^{(l)}}(a^{(l)}W^{(l)}+b^{(l)})$
$=delta^{(l+1)}$

Backpropagation with sparsity

The following conduction only works on the neural network with one single hidden layer. I do not know how it works in other scenarios. Leave it to be an open issue.
$rho_j^{(2)}=frac{1}{m}sum_{i=1}^m(a_j^{(2)})$ for all data
$S_l$ is number of node in layer $l$.
Sparsity pernality
$KL_j^{(2)} = rho logfrac{rho}{rho_j^{(2)}}+(1-rho) logfrac{1-rho}{1-rho_j^{(2)}}$
$k’(x)=(alogfrac{a}{x}+(1-a)logfrac{1-a}{1-x})’=-frac{a}{x}+frac{1-a}{1-x}$
For overal cost function
$J_{sparse}(W,b) = J(W,b) + betasum_{j=1}^{S_2}KL_j^{(2)}$
Let’s calculate
$frac{partial}{partial w_{i,k}^{(1)}}betasum_{j=1}^{S_2}KL_j^{(2)} $
$=frac{partial}{partial rho_{k}^{(1)}}betasum_{j=1}^{S_2}KL_j^{(2)}cdot frac{partial}{partial w_{i,k}^{(1)}}rho_{k}^{(2)}$
$=betasum_{j=1}^{S_2}frac{partial}{partial rho_{k}^{(2)}}(rho logfrac{rho}{rho_j^{(2)}}+(1-rho) logfrac{1-rho}{1-rho_j^{(2)}})cdot frac{partial}{partial w_{i,k}^{(1)}}rho_{k}^{(2)}$
$=beta(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdotfrac{partial}{partial w_{i,k}^{(1)}}rho_{k}^{(2)}$
$=beta(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdotfrac{partial}{partial w_{i,k}^{(1)}}frac{1}{m}sum_{i=1}^m(a_k^{(2)})$
$=frac{beta}{m}(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdotfrac{partial}{partial z_{k}^{(1)}}sum_{j=1}^m(a_k^{(2)})cdot frac{partial}{partial w_{i,k}^{(1)}}z_k^{(1)}$
$=frac{beta}{m}(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdotfrac{partial}{partial z_{k}^{(1)}}sum_{j=1}^m(a_k^{(2)})cdot frac{partial}{partial w_{i,k}^{(1)}}z_k^{(1)}$
$=frac{beta}{m}(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdot sum_{j=1}^mf’(z_k^{(1)})cdot frac{partial}{partial w_{i,k}^{(1)}}z_k^{(1)}$
$=frac{beta}{m}(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdot sum_{j=1}^mf’(z_k^{(1)})cdot a_i^{(1)}$
$=frac{beta}{m}sum_{j=1}^m(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdot f’(z_k^{(1)})cdot a_i^{(1)}$
Similarily,
$frac{partial}{partial b_{k}^{(1)}}betasum_{j=1}^{S_2}KL_j^{(2)} = =frac{beta}{m}sum_{j=1}^m(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdot f’(z_k^{(1)})$
So by backpropagation conduction, we can calculate
$delta_k^{(2)}+=beta(-frac{rho}{rho_k^{(2)}}+frac{1-rho}{1-rho_k^{(2)}})cdot f’(z_k^{(1)})$
But $delta_k^{(2)} not= frac{partial}{partial z_k^{(2)}}J_{sparse}(W,b)$

Open Questions for Sparsity Penalty

Why is Sparsity Penalty important? How does it work?
Does Sparsity Penalty works for neural network with one single hidden layer?
If not, how does Sparsity Penalty work for neural network with more than one hidden layers? How to implement the backpropagation algorithm ?

mathematics behind backpropagation in neural network

What’s $delta_i^{(l)}$

Backpropagation without sparsity

Backpropagation with sparsity

Open Questions for Sparsity Penalty

近期文章

近期评论

标签

热门

文章归档

分类目录

功能