python爬虫之路(爬取智联招聘信息)

回想起自己刚玩python的时候只是感觉代码很简洁,写一个程序不再像c,java和c++那样长篇大论(假话)。更重要的是自己可以偷偷懒(真心话),不需要考虑数据类型,一脸懵逼的调用复杂的库,在Python里面自带着丰富的库,可以各种姿势的玩!进行##的py交易%%%。在这里主要记录自己在python之路上的点点滴滴,遇到的各种坑以及我是如何填坑的,希望可以和大家一起进步!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool
import re
import time
url = 'http://sou.zhaopin.com/jobs/searchresult.ashx?jl=全国&kw=python&p=1&kt=3'
wbdata = requests.get(url).content
soup = BeautifulSoup(wbdata, 'lxml')
# 搜索到的所有的职位数量
job_count = re.findall(r"共<em>(.*?)</em>个职位满足条件", str(soup))[0]
items = soup.select("div#newlist_list_content_table > table")
# 每页职位信息数量
count = len(items) - 1
print('××××××搜索数据如下××××××:')
print("##职位一共有:",job_count)
print('##每一页职位信息:', count)
# 搜索结果页数
pages = (int(job_count) // count) + 1
print('##一共搜到的页数:', pages)
time.sleep(0.8)
def get_zhaopin(page):
url = 'http://sou.zhaopin.com/jobs/searchresult.ashx?jl=全国&kw=python&p={0}&kt=3'.format(page)
print ("第{0}页".format(page))
job_name = soup.select("table.newlist > tr > td.zwmc > div > a")
salarys = soup.select("table.newlist > tr > td.zwyx")
locations = soup.select("table.newlist > tr > td.gzdd")
times = soup.select("table.newlist > tr > td.gxsj > span")
for name, salary, location, time in zip(job_name, salarys, locations, times):
data = {
'name': name.get_text(),
'salary': salary.get_text(),
'location': location.get_text(),
'time': time.get_text(),
}
print (data)
if __name__ == '__main__':
pool = Pool(processes=1)
pool.map_async(get_zhaopin,range(1,pages+1))
pool.close()
pool.join()