This is a lecture note about regularization.
Overfitting is a situation that the hypothesis fits the training examples well but does not generalize on the whole input space. Such situation is usually caused by a complicated function with unnecessary features creating a lot of unnecessary curves and angles.
There are two main options to solve the issue of overfitting:
- Reduce the number of features.
- Manually select which features to keep.
- Use a model selection algorithm.
- Regularization: Keep all the features, but reduce the parameters $boldsymboltheta$.
Reducing redundant features is the decent approach, of which the regularization is a dumb simulation.
Linear regression
Recall the original loss function is
The regularized loss function is
The $lambda$, or the regularization parameter, determines how much the costs of our theta parameters are inflated. If it is chosen too large, it would cause underfitting.
Note that the bias parameter $theta_0$ is not penalized. That is because regularization is a dumb simulation of removing redundant features, while the bias feature $x_0$ is not the one we want to remove.
Gradient descent
nabla J(boldsymboltheta)
=
frac{1}{m}
boldsymbol X^mathsf Tleft(boldsymbol Xboldsymboltheta-boldsymbol yright)
+
frac{lambda}{m}boldsymbol l
,,
$$
where
boldsymbol l^mathsf T
=
begin{pmatrix}0&theta_1&cdots&theta_nend{pmatrix},.
$$
Normal equation
Solve the first-order condition $nabla J(boldsymboltheta)=boldsymbol0$ we obtain
where
Logistic regression
Recall the original loss function is
J(boldsymboltheta)
=-frac{1}{m}left{lnleft[gleft(boldsymbol Xboldsymbolthetaright)right]^mathsf Tboldsymbol y+lnleft[boldsymbol 1-gleft(boldsymbol Xboldsymbolthetaright)right]^mathsf Tleft(boldsymbol 1-boldsymbol yright)right},.
$$
Similarly to the linear regression, the regularized loss function of logistic regression is
J(boldsymboltheta)
=
-frac{1}{m}left{lnleft[gleft(boldsymbol Xboldsymbolthetaright)right]^mathsf Tboldsymbol y+lnleft[boldsymbol 1-gleft(boldsymbol Xboldsymbolthetaright)right]^mathsf Tleft(boldsymbol 1-boldsymbol yright)right}
+frac{lambda}{2m}sum_{j=1}^ntheta_j^2
,.
$$
Gradient descent
=
frac{1}{m}boldsymbol X^mathsf Tleft[g(boldsymbol Xboldsymboltheta)-boldsymbol yright]
+
frac{lambda}{m}boldsymbol l
,,$$
where
boldsymbol l^mathsf T
=
begin{pmatrix}0&theta_1&cdots&theta_nend{pmatrix},.
$$
Created by Daniel Zhou on 27 Jul 2015, this work is licensed under the
Creative Commons Attribution 4.0 International License
.
近期评论