introduction of deep learning Training example End

It tooks me a long time to finish the courses from Deeplearning.AI, which is founded by Andrew.Ng. Having been in the feild of ML for a while, I decide to walk through the neural network again systematically at the end of this semeter. And here I am, taking Andrew's courses on Courera.

This post is the first post of DL series, starting with my note from the courses and will followed by some Kaggle competition experience or other projects in which I might implement network.

Deep learning, also called unsupervised Feature Learning. The first term, deep, is named relative to shallow in machine learning. The shallow learning usually refers to SVM, boosting, or any other MaxEnt mathods that contain zero or one hidden layer. In contrast, deep learning has multi-layer perception and used back propagation to train.

There are two main part of deep learning.

  1. Depth of model.
  2. Features learning.

The main reasons for DL to take off is the accessibility to huge amount of data in recently years through the digitization of society and the repidly developemnt of computation power. These two factors feed the two parts of high level performance of NN:

  1. Being able to train a big enough model.
  2. Huge amount of labled data.

Training example

Cost function & Gradient Descent

Cost function, which is generated by loss function, is one of the key concept in network. It evaluates the performance of the existing model and provides a path to update parameters for us to getting a better performace. And that path is called gradient descent.

In the course, Andrew used Logistic regression as example.

For recap, the logistic regession is:

[
begin{aligned}
Given ; x, ;hat{y}=P(y=1|x), where; 0leqhat{y}leq1 \
hat{y}^{(j)}=sigma(w^Tx^{j}+b), ;where;sigma(Z^{(j)})=frac{1}{1+e^{-z^{(j)}}}
end{aligned}
]

The loss function of logistic regession is:

[
begin{aligned}
L(hat{y}^{(j)} , y^{(j)})=&frac{1}{2}(hat{y}^{(j)}-j^{(j)})^2 \
=&-(y^{(j)}log(hat{y}^{(j)})+(1-y^{(j)})log(1-hat{y}^{(j)})
end{aligned}
]

The cost function is:

[
begin{aligned}
J(w, b)=&frac{1}{m}sum_{i=1}^{m}{L(hat{y}^{(j)} , y^{(j)})} \
=& frac{1}{m}sum_{i=1}^{m}{[-(y^{(j)}log(hat{y}^{(j)})+(1-y^{(j)})log(1-hat{y}^{(j)})]}
end{aligned}
]

From the loss function & cost function, we are able to calculate the Gradient on each parameter we need to update. The gradient, also called derivative, here indicates the direction on each parameter, from which direction the cost function will fastest descrese. And by descrease, we minimumize the cost function to archieve our training goal.

In later of the course, the logistic example Andrew used is slighly different from the model above. The new model is:

[
begin{aligned}
z&=w_1x_1+w_2x_2+b \
a&=sigma(z) \
L(a, y)&=-ylog(a)+(1-y)log(1-a) \
end{aligned}
]

The derivatives: [
begin{aligned}
da=& frac{dL(a, y)}{da}=-frac{y}{a}+frac{1-y}{1-a} \
dz=& frac{dL(a, y)}{dz}=frac {dL}{da} * frac{da}{dz}=a-y \
dw_1=& frac{partial L}{partial w_1} = x_1 * dz \
dw_2=& frac{partial L}{partial w_2} = x_2 * dz \
db=& frac{partial L}{partial b} = dz
end{aligned}
]

Once we get all the derivative, parameters can be updated as followed:

[
begin{aligned}
w_1&=w_1 - alpha* dw_1 \
w_2&=w_2 - alpha* dw_2 \
b&=b- alpha* db
end{aligned}
]

And till here, we performed the gradient descent once on one instance. The procedure from input x to loss function & cost function is called forward propagation. And the procedure from lost function & Cost function to update parameter is called back propagation.

End

The example above is demonstrating the training on only one input. To perfrom the whole model training procedure, we need to iterate over all the instances for many times, which will be dscussed in the following posts.