
Arthur Samuel(1959): Field of study that gives compuers the ability to learn without being explicitly programmed.
Tom Mitchell(1998): A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
Machine learning algorithms
- Suspervised learning
- Unsuspervised learning
- Others like: Reinforcement learning, recommender systems.
Example: Playing chess
E = the Experience of playing many chess games.
T = Task of playing chess
P = the probability of win the chess game
Suspervised learning
Given a data set and know how the correct output should look like. Having the idea that there is a relationship between the “input” and “output”.
“Regression” Problems
input variables –> continuous function
Examples:
① Data about house size & price. –> Predict the price of a house sized 100 m^2.
② Pictures of many people with different ages. –> Predict a picture of “John”, and predict his age.
“Classification” Problems
input variables –> discrete categories
Examples:
① Data about a house price & time. –> Predict if tomorrow’s price is higher than $100,000
② Data about humors —— malignant / benign –> Predict the patient’s tumor is malignant or benig.
Unsuspervised learning
Allow us to approach problems with little or no idea what out results should be like. We can get some information by using the data.
- Clustering the data on relationships among the variables in the data.
- No feedback based on the prediction results.
Example:
① Find a way to group 1,000,000 different genes into similar or related groups. (Clustering)
② The “Cocktail Party Algorithm”, allows you to find structure in a chaotic environment. (Non-clustering)
Model Representation
Some symbols
- x(i): “input” variables (features)
- y(i): “output/target” variables
- (x(i), y(i)): a training example
- m: the training set
- X: the space of input
- Y: the space of output
The following is a simple example picture of machine learning ↓

where h(x) refers to “Hypothesis Function“
The Cost Function
J(θ0, θ1) = 1/(2m)*(∑(hθ(xi)- yi)2
This function is otherwise called the “Squared error function” or “Mean squared error”.
The mean is halved (1/2) as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term.
So the key to solve the problem is to find the global minimum of J(θ0, θ1).
The simpliest way to find minimum - Gradient Descent Algorithm
θj := θj - α∂/∂θj*J(θ0,θ1)
:= means assignment
where j = 0, 1 represents the feature index number. At each iteration of j, one should simulaneously update the parameters, θ. Until we reach the minimum. One thing we should be considered is that we must calculated all the value of θ first and then try to assign a new value to all the θ. Otherwise, it can not be called Gradient Descent Algorithm and may lead to errors.
How does Gradient Descent converge with a fixed step size α?
Updated @ 2019.04.24 17点03分




近期评论