本文首发于我的博客:http://gongyanli.com
代码传送门:https://github.com/Gladysgong/seCrawler
简书: https://www.jianshu.com/p/4e244563849a
CSDN: https://blog.csdn.net/u012052168/article/details/79762586
A scrapy project can crawl search result of Google/Bing/Baidu
基于scrapy来做的爬虫项目,可以根据关键字来抓取从百度、bing、google中所搜索到的结果
1.refer
copying by https://github.com/xtt129/seCrawler and rewrite,adding title and abstract.
prerequisite
python 3.5 and scrapy is needed.
commands
run one command to get 50 pages result from search engine with keyword, the result would be kept in the “urls.txt” under the current directory.
####Bing
1 |
|
####Googlescrapy crawl keywordSpider -a keyword=Spider-Man -a se=google -a pages=50
results
url,title and abstract will be stored in the urls.txt
limitation
The project doesn’t provide any workaround to the anti-spider measure like CAPTCHA, IP ban list, etc.
But to reduce these measures, we recommand to set DOWNLOAD_DELAY=10
in settings.py file to add a temporisation (in second) between the crawl of two pages, see details in Scrapy Setting.
近期评论