python 网络数据采集

1.1 网络连接

举例代码：

from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())

这里的urllib是python的标准库，包含了从网络请求数据，处理cokkie，甚至改变请求头和用户代理这些元数据的函数。

urlopen用来打开并读取一个从网络获取的远程对象。因为它是一个非常通用的库（可以轻松读取HTML文件、图像文件，或其他任何文件流）。

1.2 BeautifulSoup简介

BeautifulSoup通过定位HTML标签来格式化和组织复杂的网络信息，用简单易用的python对象为我们展现XML结构信息。
举例代码：

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)

结果：
<h1>An Interesting Title</h1>

结构解析：

- html → <html><head>...</head><body>...</body></html>

    - head → <head><title>A Useful Page<title></head>

        - title → <title>A Useful Page</title>

    - body → <body><h1>An Int...</h1><div>Lorem ip...</div></body>

        - h1 → <h1>An Interesting Title</h1>

        - div → <div>Lorem Ipsum dolor...</div>

可见，<h1>标签在BeautifulSoup对象bsObj结构的第二层（html-body-h1）。从对象里提取h1标签时，可以直接调用：
bsObj.h1
其他例子：
bsObj.html.body.h1
bsObj.body.h1
bsObj.html.h1

1.2.3 可靠的网络连接

html = urlopen("http://www.pythonscraping.com/pages/page1.html")
可能存在的两种异常：

网页在服务器上不存在（或者获取页面的时候出现错误）
服务器不存在

第一种错误，程序会返回HTTP错误，可能是“404 Page Not Found”，“500 Internal Server Error”等。所有类似情形，urlopen函数都会抛出“HTTPError”异常。解决方法：

try:
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:
    print(e)
    # 返回空值，中断程序，或者执行另一个方案
else:
    # 程序继续。注意：如果你已经在上面异常捕捉那一段代码里返回或中断（break）。
    # 那么就不需要使用else语句了，这段代码也不会执行

第二种错误，服务器不存在（链接打不开或URL链接写错了），urlopen会返回一个None对象。这个对象与其他编程语言中的null类似。我们可以增加一个判断语句检测返回的html是不是None：

if html is None:
    print("URL is not found")
else:
    # 程序继续

此外，也可能网页成功获取，但网页内容并非我们期望的那样。如果自己想调用的标签不存在，BeautifulSoup会返回None对象。如果再调用这个None对象下面的子标签，就会发生AttributeError错误。
👇这行代码（nonExistentTag是虚拟的标签，BeautifulSoup对象里实际没有）：
print(bsObj.nonExistentTat)
会返回一个None对象。处理和检查这个对象十分必要。如果不检查，直接调用这个None对象的子标签，会有麻烦。
举例；
print(bsObj.nonExistentTat.someTag)
会返回异常：
AttributeError: 'NoneType' object has no attribute 'someTag'

解决方法：进行检查：

try:
    badContent = bsObj.nonExistingTag.anotherTag
except AttributerError as e:
    print("Tag was not found")
else:
    if badContent == None:
        print("Tag was not found")
    else:
        print(badContent)

结合这个检查，换一种写法写👆爬虫的代码：

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read())
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title
title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

python 网络数据采集

1.1 网络连接

1.2 BeautifulSoup简介

1.2.3 可靠的网络连接

近期文章

近期评论

标签

热门

文章归档

分类目录

功能