machine learning: regularization Linear regression Logistic regression

Machine learning: regularization

This is a lecture note about regularization.

Overfitting is a situation that the hypothesis fits the training examples well but does not generalize on the whole input space. Such situation is usually caused by a complicated function with unnecessary features creating a lot of unnecessary curves and angles.

There are two main options to solve the issue of overfitting:

  • Reduce the number of features.
    • Manually select which features to keep.
    • Use a model selection algorithm.
  • Regularization: Keep all the features, but reduce the parameters $boldsymboltheta$.

Reducing redundant features is the decent approach, of which the regularization is a dumb simulation.

Linear regression

Recall the original loss function is

The regularized loss function is

The $lambda$, or the regularization parameter, determines how much the costs of our theta parameters are inflated. If it is chosen too large, it would cause underfitting.

Note that the bias parameter $theta_0$ is not penalized. That is because regularization is a dumb simulation of removing redundant features, while the bias feature $x_0$ is not the one we want to remove.

Gradient descent

$$
nabla J(boldsymboltheta)
=
frac{1}{m}
boldsymbol X^mathsf Tleft(boldsymbol Xboldsymboltheta-boldsymbol yright)
+
frac{lambda}{m}boldsymbol l
,,
$$

where

$$
boldsymbol l^mathsf T
=
begin{pmatrix}0&theta_1&cdots&theta_nend{pmatrix},.
$$

Normal equation

Solve the first-order condition $nabla J(boldsymboltheta)=boldsymbol0$ we obtain

where

$$boldsymbol L=begin{pmatrix}0\&1\&&ddots\&&&1end{pmatrix},.$$

Logistic regression

Recall the original loss function is

$$
J(boldsymboltheta)
=-frac{1}{m}left{lnleft[gleft(boldsymbol Xboldsymbolthetaright)right]^mathsf Tboldsymbol y+lnleft[boldsymbol 1-gleft(boldsymbol Xboldsymbolthetaright)right]^mathsf Tleft(boldsymbol 1-boldsymbol yright)right},.
$$

Similarly to the linear regression, the regularized loss function of logistic regression is

$$
J(boldsymboltheta)
=
-frac{1}{m}left{lnleft[gleft(boldsymbol Xboldsymbolthetaright)right]^mathsf Tboldsymbol y+lnleft[boldsymbol 1-gleft(boldsymbol Xboldsymbolthetaright)right]^mathsf Tleft(boldsymbol 1-boldsymbol yright)right}
+frac{lambda}{2m}sum_{j=1}^ntheta_j^2
,.
$$

Gradient descent

$$nabla J(boldsymboltheta)
=
frac{1}{m}boldsymbol X^mathsf Tleft[g(boldsymbol Xboldsymboltheta)-boldsymbol yright]
+
frac{lambda}{m}boldsymbol l
,,$$

where

$$
boldsymbol l^mathsf T
=
begin{pmatrix}0&theta_1&cdots&theta_nend{pmatrix},.
$$

Created by Daniel Zhou on 27 Jul 2015, this work is licensed under the

Creative Commons Attribution 4.0 International License

.