首页>itarticle>daniel_wu Data visualization Linear regression with sklearn Problems & Evaluations
daniel_wu Data visualization Linear regression with sklearn Problems & Evaluations
admin11月 13, 20200
This is a crawler created to craw 3000 second house informations from ‘lianjia.com’, a Chinese house selling website, and creating a linear regression model to predict 2nd house prices
On the first part of this program, i’m going to creep basic information for the second hand houses in Lianjia, such as the price information, the locations, unit prices etc.
import pandas as pd import numpy as np import matplotlib.pyplot as plt
import seaborn as sns import mpl_toolkits
#setting group of empty lists to hold information crawed HouseInfoList = [] PriceInfoList = [] LocationList = [] Listlist = [] AreaList = [] ArrangementList = [] FullInformationList = [] unitPriceList = []
for page in range(0,101): url = 'https://nj.lianjia.com/ershoufang/pg{}'.format(page) print('Downloading page %s...' % url) os.makedirs('HouseInfo',exist_ok=True)
res = requests.get(url) res.raise_for_status() soup = bs4.BeautifulSoup(res.text,"lxml")
#Crawling basic house information postElem = soup.select('.title a') for i in range(0, len(postElem)-1): HouseInfoList.append(postElem[i].text)
#Crawling house prices TotalPrice = soup.select('.totalPrice span') for i in range(0, len(TotalPrice)): PriceInfoList.append(TotalPrice[i].text)
#Crawling locations Location = soup.select('.houseInfo a') for i in range(0, len(Location)): LocationList.append(Location[i].text) #Crawling unitPrice unitPrice = soup.select('.unitPrice') for i in range(0,len(unitPrice)): unitPriceList.append(unitPrice[i].text)
The area information and house arrangements are not seperated independently on the website, butit’s shown in the form of string where elements are seperated with the sign ‘ | ‘. So on the first place i need to split the string and get area and arrangement information on by one.
#Area #a series of informations FullInformation = soup.select('div.houseInfo') for i in range(0, len(FullInformation)): FullInformationList.append(FullInformation[i].text)
#plan B:turn every single string in the list into a new list,list in a list #print(FullInformationList) for i in range (0,len(FullInformationList)): Listlist.append(FullInformationList[i].split(' | ')) for i in range(0,len(Listlist)): o = Listlist[i] for k in range(0,len(o)): index = o[k].find('平米') if index != -1: area = o[k][:index] AreaList.append(area)
#delete redundant part of unitprice unitPriceList2 = [] for i in range(0,len(unitPriceList)): unitPriceList2.append(unitPriceList[i][2:-4])
#Set up CSV files and write in information in 5 columns df = pd.DataFrame(columns = ('HouseInformation','Price','Location','Area','unitPrice')) for i in range (0,len(HouseInfoList)): df.loc[i] = [HouseInfoList[i],PriceInfoList[i],LocationList[i],AreaList[i],unitPriceList2[i]] df.to_csv('2ndHouseInfo.csv',sep=',')
#seperate different prices into different levels Pricelist = [] for i in range(0,len(PriceInfoList)): Pricelist.append(float(PriceInfoList[i])) #print(Pricelist)
#append prices to different groups which is seperated by different price groups for i in range(0,len(Pricelist)): if0 < Pricelist[i] < 100: list_0.append(Pricelist[i]) if100 <= Pricelist[i] < 200: list_1.append(Pricelist[i]) if200 <= Pricelist[i] < 300: list_2.append(Pricelist[i]) if300 <= Pricelist[i] < 400: list_3.append(Pricelist[i]) if500 <= Pricelist[i] < 700: list_4.append(Pricelist[i]) if Pricelist[i] >= 700: list_5.append(Pricelist[i])
#plot bar charts plt.bar(range(len(groupslist)), groupslist) plt.show()
Linear regression with sklearn
1 2 3 4 5 6
#print the scatter plot in the relations of total price and the area a_ = [] b_ = [] for i in range(0,len(AreaList)): a_.append(float(PriceInfoList[i])) b_.append(float(AreaList[i]))
#plot the fit line plt.plot(a_,b_, "b.") plt.plot(X2,lin_reg.predict(X2),color = 'red',linewidth = 1) plt.show()
Problems & Evaluations
1.When crawling information from Lianjia.com there is a problem for the crawler to automatically loading the next page of a site. The solution is to write a loop by using ‘format’, as the difference of different pages on the website are only the sigle number at the end of the link, shown as ‘pg1’in ‘https://nj.lianjia.com/ershoufang/pg1'
1 2
for page in range(0,101): url = 'https://nj.lianjia.com/ershoufang/pg{}'.format(page)
2.After spliting the string in the list, there is a problem to get a certain information in the whole information list. For instance, to get the area ‘11.3 m^2’ in the list[‘Hello’,’11.3 m^2’,’x’,’y’], we can use find(‘m^2’) to distinguish whether the string represents the area. And by using [:index] we can get the certain mgnitude of the area.
1 2 3 4 5 6 7 8 9 10 11
#plan B for i in range (0,len(FullInformationList)): Listlist.append(FullInformationList[i].split(' | ')) for i in range(0,len(Listlist)): o = Listlist[i] for k in range(0,len(o)): index = o[k].find('平米') if index != -1: area = o[k][:index] AreaList.append(area)
近期评论