machine learning – cross-entropy loss for softmax

When we use Cross-Entropy Loss as Cost Function for Softmax, we need to know its partitial derivatvies. Here I want to do some mathematics inference for Cross-Entropy Loss,

May the Fourth be with you.

Defintions

Hypothesis Function

$h(x^{(i)}) = frac{1}{sum_{j=1}^Ke^{theta_jx^{(i)}}}left[begin{matrix}e^{theta_1x^{(i)}}\ e^{theta_2x^{(i)}}\vdots\e^{theta_kx^{(i)}}end{matrix}right]$

In vector form
$h(x^{(i)}) = frac{1}{sum_{j=1}^Ke^{theta_jx^{(i)}}}e^{Theta x^{(i)}}$

where $x^{(i)}$ is $Ntimes1$ vector for a single training data
$Theta$ is $Ktimes N$ weight matrix, so $theta_i$ is $1times N$ vector

Cross-Entropy Loss Function

$E(theta)=-frac{1}{N}sum_{iin N}y^{(i)}circ log(h(x^{(i)}))$

where $y^{(i)}$ is a $1times K$ labeled data vector, which is in $[0, …,0, 1, 0, …, 0]$ form
and $log(h(x^{(i)}))$ is $1times K$ vector

Partial Derivatives

Let’s define
$z^{(i)} = Thetacdot x^{(i)}$, where is $K times 1$ vector
$o^{(i)} = h(x^{(i)})$

$frac{partial E}{partial z^{(i)}} = frac{partial }{partial z^{(i)}}[-frac{1}{N}sum_{nin N}(y^{(n)}circ log(o^{(n)}))]$

Here we can remove $sum_{nin N}$ because $z^{(i)}$ only affects $y^{(i)}circ log(o^{(i)}))$
$frac{partial E}{partial z^{(i)}}=-frac{1}{N} frac{partial }{partial z^{(i)}}(y^{(i)}circ log(o^{(i)}))$

Let’s assume $y_j^{(i)}=1$ when $j=k$ and $y_j^{(i)}=0$ when $j neq k$
$y^{(i)}circ log(o^{(i)})=sum_{j=1}^Ky_j^{(i)}log(o_j^{(i)})=log(o_k^{(i)}),$ where $y_k^{(i)}=1$

$frac{partial E}{partial z^{(i)}}=-frac{1}{N} frac{partial }{partial z^{(i)}}log(o_k^{(i)})cdot I$, where $I$ is $Ktimes 1$ vector
$=-frac{1}{N} frac{partial }{partial z^{(i)}}log(frac{e^{z_k^{(i)}}}{sum_{j=1}^Ke^{z_j^{(i)}}})cdot I$
$=-frac{1}{N} frac{partial }{partial z^{(i)}}({z_k^{(i)}}- log(sum_{j=1}^Ke^{z_j^{(i)}}))cdot I$
$=-frac{1}{N} (I_k-frac{partial }{partial z^{(i)}}log(sum_{j=1}^Ke^{z_j^{(i)}})cdot I),$ where $I_k=y^{(i)}$ because $y_k^{(i)}=1$
$=-frac{1}{N} (y^{(i)}-frac{1}{sum_{j=1}^Ke^{z_j^{(i)}}}frac{partial }{partial z^{(i)}}sum_{j=1}^Ke^{z_j^{(i)}}cdot I)$

Let’s consider every element for the vector form:
$frac{partial }{partial z_k^{(i)}}sum_{j=1}^Ke^{z_j^{(i)}} = e^{z_k^{(i)}}$
so
$frac{1}{sum_{j=1}^Ke^{z_j^{(i)}}}frac{partial }{partial z^{(i)}}sum_{j=1}^Ke^{z_j^{(i)}}cdot I=frac{1}{sum_{j=1}^Ke^{z_j^{(i)}}}left[begin{matrix}e^{theta_1x^{(i)}}\ e^{theta_2x^{(i)}}\vdots\e^{theta_kx^{(i)}}end{matrix}right]=o^{(i)}$
So
$frac{partial E}{partial z^{(i)}}=-frac{1}{N} (y^{(i)}-o^{(i)})$
In vector form
$frac{partial E}{partial z}=-frac{1}{N} (y-o)$
So for
$frac{partial E}{partial Theta}=frac{partial E}{partial z}frac{partial z}{partial theta}
=-frac{1}{N} (y-o)frac{partial}{partial Theta}(Thetacdot x)=-frac{1}{N} (y-o)cdot x$

Summary

$delta = frac{partial E}{partial z}=-frac{1}{N} (y-h(x))$

$frac{partial E}{partial Theta}=-frac{1}{N} (y-h(x))cdot x$