normal equation(cs229) Gradient descent vs Normal equation

Method to solve for $theta$ analytically.

1-D

$thetainmathbb{R},$
$$J(theta)=atheta^2+btheta+c$$
Let
$$frac{d}{dtheta}J(theta)=0$$
Slove $theta$.

n-D

$thetainmathbb{R^{n+1}},$
$$J(theta_0,theta_1,…,theta_n)=frac{1}{2m}sum_{i=1}^m(h_{theta}(x^{(i)})-y^{(i)})^2$$
for every $j$, let
$$frac{partial}{partialtheta_j}J(theta_0,theta_1,…,theta_n)=…=0$$
Solve for $theta_0,theta_1,…,theta_n$.

$$mathbfTheta_{(n+1)times1}=(mathbf X_{(n+1)times m}^topmathbf X_{mtimes(n+1)})^{-1}mathbf X_{(n+1)times m}^topmathbf y_{mtimes1}$$

for $m$ set of data examples, say $(x^{(1)},y^{(1)}),…,(x^{(m)},y^{(m)})$ and $n$ features,

$$mathbf{x^{(i)}}=
begin{Bmatrix}
x_0^{(i)}\
x_1^{(i)}\
x_2^{(i)}\
vdots \
x_n^{(i)}\
end{Bmatrix}_{(n+1)times1}
inmathbb{R^{n+1}}$$

Design matrix $mathbf{X}$:

$$
mathbf{X}=
begin{bmatrix}
mathbf{(x^{(1)})^top} \
mathbf{(x^{(2)})^top} \
mathbf{(x^{(3)})^top} \
vdots \
mathbf{(x^{(m)})^top} \
end{bmatrix}_{mtimes(n+1)}
$$

Gradient descent vs Normal equation

  • Feature scaling is not so important in normal equation, but is important in gradient descent.

  • Gradient descent need to choose $alpha$ and iterate many tiomes. Normal equation, by contrast, no need to choose $alpha$ also no need to iterate.

  • For $m$ training examples and $n$ features, gradient descent works well even when $n$ is large. In contrast with normal equation, we nned to calculate $(mathbf{X^top}mathbf{X})_{ntimes n}^{-1}$, so we’ll get slow down if $n$ is large. It takes $O(n^3)$ time, if $nge 10^4$ we might choose gradient descent method.