machine-learning – linear regression Multivariate Linear Regression Features and Polynomial Regression Normal Equation

In machine learning, if some datas could be divided into two parts clearly using a line, Linear Regression can be very useful and accurate.

Cost Function

J(θ0, θ1) = 1/(2m)*(∑(hθ(xi)- yi)2

Algorithm

repeat until convergence:
{

θ0 := θ0 - α/m∑(i=1,m)[(h0(xi) - yi)]

θ1 := θ1 - α/m∑(i=1,m)[(h0(xi) - yi)xi]

}

In this gradient descent algorithm, we use the entire training set on every step, so it is also called “batch gradient descent”.

Define x0 = 1, then we can get a more general expression, where θ0 := θ0 - α/m∑(i=1,m)[(h0(xi) - yi)x0], then the prediction matrix can be updated more conveniently by the following equation ( Here, I will use {} to present a matrix).

{Prediciton} = {1, DataMatrix} * {parameters(θ)}

Again, here we add a all one column to the first column of the data matrix. So the only matter thing is doing not forget to add it or knowing why there is a “all-one column”.

Multivariate Linear Regression

If we have numbers of features, we just do the same thing. Actually, I’ve just tell you how to deal with multivariate linear regression above. We just need to write down more lines of θi.

repeat until convergence:
{

θ0 := θ0 - α/m∑(i=1,m)[(h0(xi) - yi)xi0]

θ1 := θ1 - α/m∑(i=1,m)[(h0(xi) - yi)xi1]

θ2 := θ2 - α/m∑(i=1,m)[(h0(xi) - yi)xi2]

… …

θj := θj - α/m∑(i=1,m)[(h0(xi) - yi)xij]

}

Cost Function

J(θ0, θ1, θ2, …) = α/m∑(i=1,m) [(hθ(xi)- yi)2]*

Features and Polynomial Regression

Feature Scaling

Feature scaling is very important in machine learning, it will help us speed up the learning rate and save a lot of time. Just imagine, if we have a two dimentional matrix with one dimention in [1, 10000], and the other dimention in [1, 100], the coutour graph will just be like a very long or very thin ellipse and be difficult to convengence when using normal methods. But if we use feature scaling, we can change these two dimentions into two similar range, like in -1 < xi < 1, this coutour figure would be a circle and be very easy to solve.

Get every feature into approximately a (-1 < xi < 1) range

This method is used when all the features have the similar values.
xi/(max(xi) - min(xi))

Mean normalization (-1/2 < xi < 1/2)

This method is used when all the features have very different values from 1 to 100, for instance.

(xii)/(max(xi) - min(xi))

Replace xi with μi, which is the average of x, to make features approximately zero mean.

Learning Rare “α”

  • If α is too small; slow convergence
  • If α is too large; J(θ) may not decrease on every iteration and thus may not converge.

Polynomial Regression

Sometimes, if we want to make more accurate predictions, we can improve our features and the form of our hypothesis function in a couple of different ways. For example, we can combine multiple features into one.

Eg. combine x1 and x2 into feature x3 by taking x1x2.

In this situation, the hypothesis function can be in the following shows:

hθ(x) = θ0 + θ1x1 + θ2x12

hθ(x) = θ0 + θ1x1 + θ2x13

and so on …

※ Feature scaling is very important.

Normal Equation

θ = (XTX)-1XTy

Normal Equation

  • No need to choose alpha
  • No need to iterate
  • O(n3), need to calculate(XTX)-1
  • Slow if n is large

Gradient Descent

  • Need to choose alpha
  • Need many iterations
  • O(kn2)
  • Works well when n is very large

Therefore, if we have a very large number of features, using the normal equation will be very slow and silly. In practice, when n exceeds 10,000, it’s better to go from a normal solution to an iteratice process.

What if the XTX is noninvertibility?

  • Reduce Redundant features, where two features are very closely related.
  • Too many features (eg, m<n). In this case, delete some features or use “regularization”(will be discussed later).

In Octave, however, we can use pinv(θ) rather than inv(θ) to get a value of θ even if matrix θ is noninvertibility, but I doubt if it is correct.