python web programming urllib2 Beautiful Soup

文章目录

处理Web网页时需要的一些模块

urlparse模块对url进行解析

1
2
3
4
from urlparse import urlparse


print urlparse("https://www.youtube.com/results?sp=CAM%253D&q=metasploitable+2")

结果:

1
ParseResult(scheme='https', netloc='www.youtube.com', path='/results', params='', query='sp=CAM%253D&q=metasploitable+2', fragment='')

urllib2

urllib2模块处理打开url的问题

1
2
3
4
5
6
7
8
import urllib2


f = urllib2.urlopen("http://cuiqingcai.com/1319.html")
print f.getcode() # 状态码
print f.geturl() # 实际得到网页所属的url
print f.info() # 得到meta information
print f.read() # 网页内容

一些例子

HTTP 验证

1
2
3
4
5
6
7
8
9
10
11
import urllib2
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
uri='https://mahler:8092/site-updates.py',
user='klem',
passwd='kadidd!ehopper')
opener = urllib2.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib2.install_opener(opener)
urllib2.urlopen('http://www.example.com/login.html')

添加HTTP头

1
2
3
4
5
6
import urllib2
req = urllib2.Request('http://www.example.com/')
req.add_header('Referer', 'http://www.python.org/')
# Customize the default User-Agent header value:
req.add_header('User-Agent', 'urllib-example/0.1 (Contact: . . .)')
r = urllib2.urlopen(req)

Beautiful Soup

Beautiful Soup库用来解析网页

参考:http://cuiqingcai.com/1319.html