In machine learning, if some datas could be divided into two parts clearly using a line, Linear Regression can be very useful and accurate.

Cost Function

J(θ₀, θ₁) = 1/(2m)*(∑(h_θ(x_i)- y_i)²

Algorithm

repeat until convergence:
{

θ₀ := θ₀ - α/m∑(i=1,m)[(h₀(xⁱ) - yⁱ)]

θ₁ := θ₁ - α/m∑(i=1,m)[(h₀(xⁱ) - yⁱ)xⁱ]

}

In this gradient descent algorithm, we use the entire training set on every step, so it is also called “batch gradient descent”.

Define x₀ = 1, then we can get a more general expression, where θ₀ := θ₀ - α/m∑(i=1,m)[(h₀(x_i) - y_i)x₀], then the prediction matrix can be updated more conveniently by the following equation ( Here, I will use {} to present a matrix).

{Prediciton} = {1, DataMatrix} * {parameters(θ)}

Again, here we add a all one column to the first column of the data matrix. So the only matter thing is doing not forget to add it or knowing why there is a “all-one column”.

Multivariate Linear Regression

If we have numbers of features, we just do the same thing. Actually, I’ve just tell you how to deal with multivariate linear regression above. We just need to write down more lines of θ_i.

repeat until convergence:
{

θ₀ := θ₀ - α/m∑(i=1,m)[(h₀(xⁱ) - yⁱ)xⁱ₀]

θ₁ := θ₁ - α/m∑(i=1,m)[(h₀(xⁱ) - yⁱ)xⁱ₁]

θ₂ := θ₂ - α/m∑(i=1,m)[(h₀(xⁱ) - yⁱ)xⁱ₂]

… …

θ_j := θ_j - α/m∑(i=1,m)[(h₀(xⁱ) - yⁱ)xⁱ_j]

}

Cost Function

J(θ₀, θ₁, θ₂, …) = α/m∑(i=1,m) [(h_θ(xⁱ)- yⁱ)²]*

Features and Polynomial Regression

Feature Scaling

Feature scaling is very important in machine learning, it will help us speed up the learning rate and save a lot of time. Just imagine, if we have a two dimentional matrix with one dimention in [1, 10000], and the other dimention in [1, 100], the coutour graph will just be like a very long or very thin ellipse and be difficult to convengence when using normal methods. But if we use feature scaling, we can change these two dimentions into two similar range, like in -1 < x_i < 1, this coutour figure would be a circle and be very easy to solve.

Get every feature into approximately a (-1 < x_i < 1) range

This method is used when all the features have the similar values.
x_i/(max(x_i) - min(x_i))

Mean normalization (-1/2 < x_i < 1/2)

This method is used when all the features have very different values from 1 to 100, for instance.

(x_i-μ_i)/(max(x_i) - min(x_i))

Replace x_i with μ_i, which is the average of x, to make features approximately zero mean.

Learning Rare “α”

If α is too small; slow convergence
If α is too large; J(θ) may not decrease on every iteration and thus may not converge.

Polynomial Regression

Sometimes, if we want to make more accurate predictions, we can improve our features and the form of our hypothesis function in a couple of different ways. For example, we can combine multiple features into one.

Eg. combine x₁ and x₂ into feature x₃ by taking x₁x₂.

In this situation, the hypothesis function can be in the following shows:

h_θ(x) = θ₀ + θ₁x₁ + θ₂x₁²

h_θ(x) = θ₀ + θ₁x₁ + θ₂x₁³

and so on …

※ Feature scaling is very important.

Normal Equation

θ = (X^TX)^-1X^Ty

Normal Equation

No need to choose alpha
No need to iterate
O(n³), need to calculate(X^TX)^-1
Slow if n is large

Gradient Descent

Need to choose alpha
Need many iterations
O(kn²)
Works well when n is very large

Therefore, if we have a very large number of features, using the normal equation will be very slow and silly. In practice, when n exceeds 10,000, it’s better to go from a normal solution to an iteratice process.

What if the X^TX is noninvertibility?

Reduce Redundant features, where two features are very closely related.
Too many features (eg, m<n). In this case, delete some features or use “regularization”(will be discussed later).

In Octave, however, we can use pinv(θ) rather than inv(θ) to get a value of θ even if matrix θ is noninvertibility, but I doubt if it is correct.

machine-learning – linear regression Multivariate Linear Regression Features and Polynomial Regression Normal Equation

Cost Function

Algorithm

Multivariate Linear Regression

Cost Function

Features and Polynomial Regression

Feature Scaling

Get every feature into approximately a (-1 < x_i < 1) range

Mean normalization (-1/2 < x_i < 1/2)

Learning Rare “α”

Polynomial Regression

Normal Equation

What if the X^TX is noninvertibility?

近期文章

近期评论

标签

热门

文章归档

分类目录

功能