daniel_wu Data visualization Linear regression with sklearn Problems & Evaluations

This is a crawler created to craw 3000 second house informations from ‘lianjia.com’, a Chinese house selling website, and creating a linear regression model to predict 2nd house prices

On the first part of this program, i’m going to creep basic information for the second hand houses in Lianjia, such as the price information, the locations, unit prices etc.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

import requests
import os
import bs4

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
import mpl_toolkits




#setting group of empty lists to hold information crawed
HouseInfoList = []
PriceInfoList = []
LocationList = []
Listlist = []
AreaList = []
ArrangementList = []
FullInformationList = []
unitPriceList = []


for page in range(0,101):
url = 'https://nj.lianjia.com/ershoufang/pg{}'.format(page)
print('Downloading page %s...' % url)
os.makedirs('HouseInfo',exist_ok=True)

res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"lxml")


#Crawling basic house information
postElem = soup.select('.title a')
for i in range(0, len(postElem)-1):
HouseInfoList.append(postElem[i].text)

#Crawling house prices
TotalPrice = soup.select('.totalPrice span')
for i in range(0, len(TotalPrice)):
PriceInfoList.append(TotalPrice[i].text)

#Crawling locations
Location = soup.select('.houseInfo a')
for i in range(0, len(Location)):
LocationList.append(Location[i].text)

#Crawling unitPrice
unitPrice = soup.select('.unitPrice')
for i in range(0,len(unitPrice)):
unitPriceList.append(unitPrice[i].text)

The area information and house arrangements are not seperated independently on the website, butit’s shown in the form of string where elements are seperated with the sign ‘ | ‘. So on the first place i need to split the string and get area and arrangement information on by one.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    #Area
#a series of informations
FullInformation = soup.select('div.houseInfo')
for i in range(0, len(FullInformation)):
FullInformationList.append(FullInformation[i].text)


#plan B:turn every single string in the list into a new list,list in a list
#print(FullInformationList)
for i in range (0,len(FullInformationList)):
Listlist.append(FullInformationList[i].split(' | '))

for i in range(0,len(Listlist)):
o = Listlist[i]
for k in range(0,len(o)):
index = o[k].find('平米')
if index != -1:
area = o[k][:index]
AreaList.append(area)

#delete redundant part of unitprice
unitPriceList2 = []
for i in range(0,len(unitPriceList)):
unitPriceList2.append(unitPriceList[i][2:-4])


#Set up CSV files and write in information in 5 columns
df = pd.DataFrame(columns = ('HouseInformation','Price','Location','Area','unitPrice'))
for i in range (0,len(HouseInfoList)):
df.loc[i] = [HouseInfoList[i],PriceInfoList[i],LocationList[i],AreaList[i],unitPriceList2[i]]
df.to_csv('2ndHouseInfo.csv',sep=',')
Downloading page https://nj.lianjia.com/ershoufang/pg0...
Downloading page https://nj.lianjia.com/ershoufang/pg1...
Downloading page https://nj.lianjia.com/ershoufang/pg2...
Downloading page https://nj.lianjia.com/ershoufang/pg3...
Downloading page https://nj.lianjia.com/ershoufang/pg4...
Downloading page https://nj.lianjia.com/ershoufang/pg5...
Downloading page https://nj.lianjia.com/ershoufang/pg6...
Downloading page https://nj.lianjia.com/ershoufang/pg7...
Downloading page https://nj.lianjia.com/ershoufang/pg8...
Downloading page https://nj.lianjia.com/ershoufang/pg9...
Downloading page https://nj.lianjia.com/ershoufang/pg10...
Downloading page https://nj.lianjia.com/ershoufang/pg11...
Downloading page https://nj.lianjia.com/ershoufang/pg12...
Downloading page https://nj.lianjia.com/ershoufang/pg13...
Downloading page https://nj.lianjia.com/ershoufang/pg14...
Downloading page https://nj.lianjia.com/ershoufang/pg15...
Downloading page https://nj.lianjia.com/ershoufang/pg16...
Downloading page https://nj.lianjia.com/ershoufang/pg17...
Downloading page https://nj.lianjia.com/ershoufang/pg18...
Downloading page https://nj.lianjia.com/ershoufang/pg19...
Downloading page https://nj.lianjia.com/ershoufang/pg20...
Downloading page https://nj.lianjia.com/ershoufang/pg21...
Downloading page https://nj.lianjia.com/ershoufang/pg22...
Downloading page https://nj.lianjia.com/ershoufang/pg23...
Downloading page https://nj.lianjia.com/ershoufang/pg24...
Downloading page https://nj.lianjia.com/ershoufang/pg25...
Downloading page https://nj.lianjia.com/ershoufang/pg26...
Downloading page https://nj.lianjia.com/ershoufang/pg27...
Downloading page https://nj.lianjia.com/ershoufang/pg28...
Downloading page https://nj.lianjia.com/ershoufang/pg29...
Downloading page https://nj.lianjia.com/ershoufang/pg30...
Downloading page https://nj.lianjia.com/ershoufang/pg31...
Downloading page https://nj.lianjia.com/ershoufang/pg32...
Downloading page https://nj.lianjia.com/ershoufang/pg33...
Downloading page https://nj.lianjia.com/ershoufang/pg34...
Downloading page https://nj.lianjia.com/ershoufang/pg35...
Downloading page https://nj.lianjia.com/ershoufang/pg36...
Downloading page https://nj.lianjia.com/ershoufang/pg37...
Downloading page https://nj.lianjia.com/ershoufang/pg38...
Downloading page https://nj.lianjia.com/ershoufang/pg39...
Downloading page https://nj.lianjia.com/ershoufang/pg40...
Downloading page https://nj.lianjia.com/ershoufang/pg41...
Downloading page https://nj.lianjia.com/ershoufang/pg42...
Downloading page https://nj.lianjia.com/ershoufang/pg43...
Downloading page https://nj.lianjia.com/ershoufang/pg44...
Downloading page https://nj.lianjia.com/ershoufang/pg45...
Downloading page https://nj.lianjia.com/ershoufang/pg46...
Downloading page https://nj.lianjia.com/ershoufang/pg47...
Downloading page https://nj.lianjia.com/ershoufang/pg48...
Downloading page https://nj.lianjia.com/ershoufang/pg49...
Downloading page https://nj.lianjia.com/ershoufang/pg50...
Downloading page https://nj.lianjia.com/ershoufang/pg51...
Downloading page https://nj.lianjia.com/ershoufang/pg52...
Downloading page https://nj.lianjia.com/ershoufang/pg53...
Downloading page https://nj.lianjia.com/ershoufang/pg54...
Downloading page https://nj.lianjia.com/ershoufang/pg55...
Downloading page https://nj.lianjia.com/ershoufang/pg56...
Downloading page https://nj.lianjia.com/ershoufang/pg57...
Downloading page https://nj.lianjia.com/ershoufang/pg58...
Downloading page https://nj.lianjia.com/ershoufang/pg59...
Downloading page https://nj.lianjia.com/ershoufang/pg60...
Downloading page https://nj.lianjia.com/ershoufang/pg61...
Downloading page https://nj.lianjia.com/ershoufang/pg62...
Downloading page https://nj.lianjia.com/ershoufang/pg63...
Downloading page https://nj.lianjia.com/ershoufang/pg64...
Downloading page https://nj.lianjia.com/ershoufang/pg65...
Downloading page https://nj.lianjia.com/ershoufang/pg66...
Downloading page https://nj.lianjia.com/ershoufang/pg67...
Downloading page https://nj.lianjia.com/ershoufang/pg68...
Downloading page https://nj.lianjia.com/ershoufang/pg69...
Downloading page https://nj.lianjia.com/ershoufang/pg70...
Downloading page https://nj.lianjia.com/ershoufang/pg71...
Downloading page https://nj.lianjia.com/ershoufang/pg72...
Downloading page https://nj.lianjia.com/ershoufang/pg73...
Downloading page https://nj.lianjia.com/ershoufang/pg74...
Downloading page https://nj.lianjia.com/ershoufang/pg75...
Downloading page https://nj.lianjia.com/ershoufang/pg76...
Downloading page https://nj.lianjia.com/ershoufang/pg77...
Downloading page https://nj.lianjia.com/ershoufang/pg78...
Downloading page https://nj.lianjia.com/ershoufang/pg79...
Downloading page https://nj.lianjia.com/ershoufang/pg80...
Downloading page https://nj.lianjia.com/ershoufang/pg81...
Downloading page https://nj.lianjia.com/ershoufang/pg82...
Downloading page https://nj.lianjia.com/ershoufang/pg83...
Downloading page https://nj.lianjia.com/ershoufang/pg84...
Downloading page https://nj.lianjia.com/ershoufang/pg85...
Downloading page https://nj.lianjia.com/ershoufang/pg86...
Downloading page https://nj.lianjia.com/ershoufang/pg87...
Downloading page https://nj.lianjia.com/ershoufang/pg88...
Downloading page https://nj.lianjia.com/ershoufang/pg89...
Downloading page https://nj.lianjia.com/ershoufang/pg90...
Downloading page https://nj.lianjia.com/ershoufang/pg91...
Downloading page https://nj.lianjia.com/ershoufang/pg92...
Downloading page https://nj.lianjia.com/ershoufang/pg93...
Downloading page https://nj.lianjia.com/ershoufang/pg94...
Downloading page https://nj.lianjia.com/ershoufang/pg95...
Downloading page https://nj.lianjia.com/ershoufang/pg96...
Downloading page https://nj.lianjia.com/ershoufang/pg97...
Downloading page https://nj.lianjia.com/ershoufang/pg98...
Downloading page https://nj.lianjia.com/ershoufang/pg99...
Downloading page https://nj.lianjia.com/ershoufang/pg100...

Data visualization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#seperate different prices into different levels
Pricelist = []
for i in range(0,len(PriceInfoList)):
Pricelist.append(float(PriceInfoList[i]))
#print(Pricelist)

list_0 = []
list_1 = []
list_2 = []
list_3 = []
list_4 = []
list_5 = []

#append prices to different groups which is seperated by different price groups
for i in range(0,len(Pricelist)):
if 0 < Pricelist[i] < 100:
list_0.append(Pricelist[i])
if 100 <= Pricelist[i] < 200:
list_1.append(Pricelist[i])
if 200 <= Pricelist[i] < 300:
list_2.append(Pricelist[i])
if 300 <= Pricelist[i] < 400:
list_3.append(Pricelist[i])
if 500 <= Pricelist[i] < 700:
list_4.append(Pricelist[i])
if Pricelist[i] >= 700:
list_5.append(Pricelist[i])

groupslist = [len(list_0),len(list_1),len(list_2),len(list_3),len(list_4),len(list_5)]

#plot bar charts
plt.bar(range(len(groupslist)), groupslist)
plt.show()

1

Linear regression with sklearn

1
2
3
4
5
6
#print the scatter plot in the relations of total price and the area
a_ = []
b_ = []
for i in range(0,len(AreaList)):
a_.append(float(PriceInfoList[i]))
b_.append(float(AreaList[i]))
1
2
3
plt.scatter(a_,b_)
plt.ylabel("a")
plt.xlabel('b')

2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#Linear regression
from sklearn.linear_model import LinearRegression
data = pd.read_csv('2ndHouseInfo.csv')
X1 = data[['Price']]
y1 = data[['Area']]
X2 = np.array([[0],[2500]])
lin_reg = LinearRegression()
lin_reg.fit(X1, y1)
lin_reg.intercept_, lin_reg.coef_

#plot the fit line
plt.plot(a_,b_, "b.")
plt.plot(X2,lin_reg.predict(X2),color = 'red',linewidth = 1)
plt.show()

3

Problems & Evaluations

1.When crawling information from Lianjia.com there is a problem for the crawler to automatically loading the next page of a site. The solution is to write a loop by using ‘format’, as the difference of different pages on the website are only the sigle number at the end of the link, shown as ‘pg1’in ‘https://nj.lianjia.com/ershoufang/pg1'

1
2
for page in range(0,101):
url = 'https://nj.lianjia.com/ershoufang/pg{}'.format(page)

2.After spliting the string in the list, there is a problem to get a certain information in the whole information list. For instance, to get the area ‘11.3 m^2’ in the list[‘Hello’,’11.3 m^2’,’x’,’y’], we can use find(‘m^2’) to distinguish whether the string represents the area. And by using [:index] we can get the certain mgnitude of the area.

1
2
3
4
5
6
7
8
9
10
11
#plan B
for i in range (0,len(FullInformationList)):
Listlist.append(FullInformationList[i].split(' | '))

for i in range(0,len(Listlist)):
o = Listlist[i]
for k in range(0,len(o)):
index = o[k].find('平米')
if index != -1:
area = o[k][:index]
AreaList.append(area)