advanced optimization of logistic regression (cs229) Example

To compute $J(theta)$ and $frac{partial}{partialtheta_j}J(theta)$ with given $theta$ more efficiently, here are some algorithms:

  • Gradient descent
  • Conjugate gradient
  • BFGS
  • L-BFGS

For those three algorithms after gradient descent, their advantages are no need to manually pick $alpha$ and faster than gradient descent oftenly, but are even more sophsiticated.

Think about that the three algorithms have more clever inter-loop called line search algorithm, which automatically tries out differet learning rate $alpha$ and picks a good learning rate for every iteration so we don’t have to choose it ourselves.

Because of the better $alpha$ chosen, these algorithms end up converging much faster than gradient descent.

Example

For $n=2$

Given
$$mathbf{theta}=
begin{Bmatrix}
theta_1 \
theta_2
end{Bmatrix}$$

$$J(theta)=(theta_1-5)^2+(theta_2-5)^2tag{1}$$

Derivate terms
$$frac{partial}{partialtheta_1}J(theta)=2(theta_1-5)$$
$$frac{partial}{partialtheta_2}J(theta)=2(theta_2-5)$$

We can get $min_{theta}J(theta)$ easily by observing $(1)$, i.e.
$$mathbf{theta}=
begin{Bmatrix}
5 \
5
end{Bmatrix}$$

Implementation

1
2
3
4
5
6
function [jVal, gradient] = costFunction(theta);
jVal = (theta(1)-5)^2 + (theta(2)-5)^2;
gradient = zeros(2,1);
gradient(1) = 2*(theta(1)-5);
gradient(2) = 2*(theta(2)-5);

costFunction returns two values: cost function value jVal and derivate terms value in a matrix type assign to gradient.

1
2
3
4
options = optimset('GradObj','on','MaxIter','100');
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag]...
= fminunc(@costFunction, initialTheta, options);

fminunc is stands for function minimization unconstrained in Octave.

Remind that $thetainmathbb{R^d},dge2$

For n $theta$

Implementation

1
2
3
4
5
6
7
8
9
10
theta = [theta0; theta1; theta2; ...; thetaN];
function [jVal, gradient] = costFunction(theta);
jVal = [code to compute J(theta)];
gradient(1) = [code to compute the first derivative to J(theta) with respect to theta0];
gradient(2) = [code to compute the first derivative to J(theta) w.r.t. theta1];
...
gradient(n+1) = [code to compute the first derivative to J(theta) w.r.t. thetaN];

Where

$$frac{partial}{partialtheta_0}J(theta)=frac{1}{m}sum_{i=1}^m[(h_{theta}(x^{(i)}))-y^{(i)}*x_0^{(i)}]$$
$$frac{partial}{partialtheta_1}J(theta)=frac{1}{m}sum_{i=1}^m[(h_{theta}(x^{(i)}))-y^{(i)}*x_1^{(i)}]$$