slides crawl with python

Codes for crawl slides with python

# -*- coding: utf-8 -*-
# @Author: hejw005
# @Date:   2017-03-25 09:51:18
# @Last Modified by:   h005
# @Last Modified time: 2017-03-26 09:58:59

import wget
import requests
import bs4
import re

import io

# the website
prefix = 'https://courses.engr.illinois.edu/cs543/sp2015/'
response = requests.get(prefix)

# I want to use this to wirte the html file into file
f = io.open('output.txt','w',encoding='utf-8')

soup = bs4.BeautifulSoup(response.text,"html.parser")

# regular expression to find the string start with 'lectures/' and end with '.pdf'
# ref http://www.runoob.com/regexp/regexp-tutorial.html
pattern = re.compile('lectures/.*.pdf')

# print soup.find_all(href = pattern)

ind = 0
for ele in soup.find_all(href = pattern):
	print ind
	# print ele(0)
	# get the link
	# ref https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
	tmpHref = ele.get('href')
	tmpHref = tmpHref.encode("utf-8")
	# print tmpHref
	
	# analysis the html file and split the stirng with '%20-%20'
	pattern2 = re.compile('%20-%20')
	print tmpHref
	lis = tmpHref.split('%20-%20')
	# reorganize the file name
	filename = ''.join(lis[1].split('%20'))
	tmpHref = prefix + tmpHref
	filename = str(ind) + '_' + filename
	# print filename
	wget.download(tmpHref,filename+'.pdf')
	ind = ind + 1;

print 'done'

Character encoding problem with crawl

I have crawl data from https://www.baidu.com and save the request’s text into a log file, but there exists an encoding error.

import requests
r = requests.get('https://www.baidu.com')
print r.text

save the the text info to the logIn.log file

python logIn.py > logIn.log

The error info:

Traceback (most recent call last):
  File "logIn.py", line 11, in <module>
    print r.text
UnicodeEncodeError: 'ascii' codec can't encode characters in position 317-343: ordinal not in range(128)

Then, I found that this is caused by the error encoding problem, so I tried to solve this problem by```
writing the text info into a file with the encoding of ‘utf-8’

Here is the code:

import requests
import io
f = io.open('logIn.log','w',encoding='utf-8')
r = requests.get('https://www.baidu.com')
f.write(r.text)
f.close()

The output file is still full of messy codes.

At last, I found that the web’s encoding is not the same as the file’s encoding. And we can use r.encoding to check the encoding of the web text.

The code is:

import requests
import io
f = io.open('logIn.log','w',encoding='ISO-8859-1')
r = requests.get('https://www.baidu.com')
print r.encoding
f.write(r.text)
f.close()

By print the r.encoding, we can find that its encoding is ‘ISO-8859-1’. After saving the text info to the file with this encoding, everything goes well .