
This is a crawler created to craw 3000 second house informations from ‘lianjia.com’, a Chinese house selling website, and creating a linear regression model to predict 2nd house prices
On the first part of this program, i’m going to creep basic information for the second hand houses in Lianjia, such as the price information, the locations, unit prices etc.
1 |
|
The area information and house arrangements are not seperated independently on the website, butit’s shown in the form of string where elements are seperated with the sign ‘ | ‘. So on the first place i need to split the string and get area and arrangement information on by one.
1 |
#Area |
Downloading page https://nj.lianjia.com/ershoufang/pg0...
Downloading page https://nj.lianjia.com/ershoufang/pg1...
Downloading page https://nj.lianjia.com/ershoufang/pg2...
Downloading page https://nj.lianjia.com/ershoufang/pg3...
Downloading page https://nj.lianjia.com/ershoufang/pg4...
Downloading page https://nj.lianjia.com/ershoufang/pg5...
Downloading page https://nj.lianjia.com/ershoufang/pg6...
Downloading page https://nj.lianjia.com/ershoufang/pg7...
Downloading page https://nj.lianjia.com/ershoufang/pg8...
Downloading page https://nj.lianjia.com/ershoufang/pg9...
Downloading page https://nj.lianjia.com/ershoufang/pg10...
Downloading page https://nj.lianjia.com/ershoufang/pg11...
Downloading page https://nj.lianjia.com/ershoufang/pg12...
Downloading page https://nj.lianjia.com/ershoufang/pg13...
Downloading page https://nj.lianjia.com/ershoufang/pg14...
Downloading page https://nj.lianjia.com/ershoufang/pg15...
Downloading page https://nj.lianjia.com/ershoufang/pg16...
Downloading page https://nj.lianjia.com/ershoufang/pg17...
Downloading page https://nj.lianjia.com/ershoufang/pg18...
Downloading page https://nj.lianjia.com/ershoufang/pg19...
Downloading page https://nj.lianjia.com/ershoufang/pg20...
Downloading page https://nj.lianjia.com/ershoufang/pg21...
Downloading page https://nj.lianjia.com/ershoufang/pg22...
Downloading page https://nj.lianjia.com/ershoufang/pg23...
Downloading page https://nj.lianjia.com/ershoufang/pg24...
Downloading page https://nj.lianjia.com/ershoufang/pg25...
Downloading page https://nj.lianjia.com/ershoufang/pg26...
Downloading page https://nj.lianjia.com/ershoufang/pg27...
Downloading page https://nj.lianjia.com/ershoufang/pg28...
Downloading page https://nj.lianjia.com/ershoufang/pg29...
Downloading page https://nj.lianjia.com/ershoufang/pg30...
Downloading page https://nj.lianjia.com/ershoufang/pg31...
Downloading page https://nj.lianjia.com/ershoufang/pg32...
Downloading page https://nj.lianjia.com/ershoufang/pg33...
Downloading page https://nj.lianjia.com/ershoufang/pg34...
Downloading page https://nj.lianjia.com/ershoufang/pg35...
Downloading page https://nj.lianjia.com/ershoufang/pg36...
Downloading page https://nj.lianjia.com/ershoufang/pg37...
Downloading page https://nj.lianjia.com/ershoufang/pg38...
Downloading page https://nj.lianjia.com/ershoufang/pg39...
Downloading page https://nj.lianjia.com/ershoufang/pg40...
Downloading page https://nj.lianjia.com/ershoufang/pg41...
Downloading page https://nj.lianjia.com/ershoufang/pg42...
Downloading page https://nj.lianjia.com/ershoufang/pg43...
Downloading page https://nj.lianjia.com/ershoufang/pg44...
Downloading page https://nj.lianjia.com/ershoufang/pg45...
Downloading page https://nj.lianjia.com/ershoufang/pg46...
Downloading page https://nj.lianjia.com/ershoufang/pg47...
Downloading page https://nj.lianjia.com/ershoufang/pg48...
Downloading page https://nj.lianjia.com/ershoufang/pg49...
Downloading page https://nj.lianjia.com/ershoufang/pg50...
Downloading page https://nj.lianjia.com/ershoufang/pg51...
Downloading page https://nj.lianjia.com/ershoufang/pg52...
Downloading page https://nj.lianjia.com/ershoufang/pg53...
Downloading page https://nj.lianjia.com/ershoufang/pg54...
Downloading page https://nj.lianjia.com/ershoufang/pg55...
Downloading page https://nj.lianjia.com/ershoufang/pg56...
Downloading page https://nj.lianjia.com/ershoufang/pg57...
Downloading page https://nj.lianjia.com/ershoufang/pg58...
Downloading page https://nj.lianjia.com/ershoufang/pg59...
Downloading page https://nj.lianjia.com/ershoufang/pg60...
Downloading page https://nj.lianjia.com/ershoufang/pg61...
Downloading page https://nj.lianjia.com/ershoufang/pg62...
Downloading page https://nj.lianjia.com/ershoufang/pg63...
Downloading page https://nj.lianjia.com/ershoufang/pg64...
Downloading page https://nj.lianjia.com/ershoufang/pg65...
Downloading page https://nj.lianjia.com/ershoufang/pg66...
Downloading page https://nj.lianjia.com/ershoufang/pg67...
Downloading page https://nj.lianjia.com/ershoufang/pg68...
Downloading page https://nj.lianjia.com/ershoufang/pg69...
Downloading page https://nj.lianjia.com/ershoufang/pg70...
Downloading page https://nj.lianjia.com/ershoufang/pg71...
Downloading page https://nj.lianjia.com/ershoufang/pg72...
Downloading page https://nj.lianjia.com/ershoufang/pg73...
Downloading page https://nj.lianjia.com/ershoufang/pg74...
Downloading page https://nj.lianjia.com/ershoufang/pg75...
Downloading page https://nj.lianjia.com/ershoufang/pg76...
Downloading page https://nj.lianjia.com/ershoufang/pg77...
Downloading page https://nj.lianjia.com/ershoufang/pg78...
Downloading page https://nj.lianjia.com/ershoufang/pg79...
Downloading page https://nj.lianjia.com/ershoufang/pg80...
Downloading page https://nj.lianjia.com/ershoufang/pg81...
Downloading page https://nj.lianjia.com/ershoufang/pg82...
Downloading page https://nj.lianjia.com/ershoufang/pg83...
Downloading page https://nj.lianjia.com/ershoufang/pg84...
Downloading page https://nj.lianjia.com/ershoufang/pg85...
Downloading page https://nj.lianjia.com/ershoufang/pg86...
Downloading page https://nj.lianjia.com/ershoufang/pg87...
Downloading page https://nj.lianjia.com/ershoufang/pg88...
Downloading page https://nj.lianjia.com/ershoufang/pg89...
Downloading page https://nj.lianjia.com/ershoufang/pg90...
Downloading page https://nj.lianjia.com/ershoufang/pg91...
Downloading page https://nj.lianjia.com/ershoufang/pg92...
Downloading page https://nj.lianjia.com/ershoufang/pg93...
Downloading page https://nj.lianjia.com/ershoufang/pg94...
Downloading page https://nj.lianjia.com/ershoufang/pg95...
Downloading page https://nj.lianjia.com/ershoufang/pg96...
Downloading page https://nj.lianjia.com/ershoufang/pg97...
Downloading page https://nj.lianjia.com/ershoufang/pg98...
Downloading page https://nj.lianjia.com/ershoufang/pg99...
Downloading page https://nj.lianjia.com/ershoufang/pg100...
Data visualization
1 |
#seperate different prices into different levels |

Linear regression with sklearn
1 |
#print the scatter plot in the relations of total price and the area |
1 |
plt.scatter(a_,b_) |

1 |
#Linear regression |

Problems & Evaluations
1.When crawling information from Lianjia.com there is a problem for the crawler to automatically loading the next page of a site. The solution is to write a loop by using ‘format’, as the difference of different pages on the website are only the sigle number at the end of the link, shown as ‘pg1’in ‘https://nj.lianjia.com/ershoufang/pg1'
1 |
for page in range(0,101): |
2.After spliting the string in the list, there is a problem to get a certain information in the whole information list. For instance, to get the area ‘11.3 m^2’ in the list[‘Hello’,’11.3 m^2’,’x’,’y’], we can use find(‘m^2’) to distinguish whether the string represents the area. And by using [:index] we can get the certain mgnitude of the area.
1 |
#plan B |




近期评论