一 介绍
Scrapy 是基于twisted框架开发而来,twisted是一个流行的事件驱动的python网络框架。因此Scrapy使用了一种非阻塞(又名异步)的代码来实现并发。整体架构大致如下
The data flow in Scrapy is controlled by the execution engine, and goes like this:
- The Engine gets the initial Requests to crawl from the Spider.
- The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
- The Scheduler returns the next Requests to the Engine.
- The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see
process_request()). - Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see
process_response()). - The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see
process_spider_input()). - The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see
process_spider_output()). - The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
- The process repeats (from step 1) until there are no more requests from the Scheduler.
Components:
-
引擎(EGINE)
引擎负责控制系统所有组件之间的数据流,并在某些动作发生时触发事件。有关详细信息,请参见上面的数据流部分。
-
调度器(SCHEDULER)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL的优先级队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址 -
下载器(DOWLOADER)
用于下载网页内容, 并将网页内容返回给EGINE,下载器是建立在twisted这个高效的异步模型上的 -
爬虫(SPIDERS)
SPIDERS是开发人员自定义的类,用来解析responses,并且提取items,或者发送新的请求 -
项目管道(ITEM PIPLINES)
在items被提取后负责处理它们,主要包括清理、验证、持久化(比如存到数据库)等操作 -
下载器中间件(Downloader Middlewares)
位于Scrapy引擎和下载器之间,主要用来处理从EGINE传到DOWLOADER的请求request,已经从DOWNLOADER传到EGINE的响应response,你可用该中间件做以下几件事- process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website);
- change received response before passing it to a spider;
- send a new Request instead of passing received response to a spider;
- pass response to a spider without fetching a web page;
- silently drop some requests.
-
爬虫中间件(Spider Middlewares)
位于EGINE和SPIDERS之间,主要工作是处理SPIDERS的输入(即responses)和输出(即requests)
官网链接:https://docs.scrapy.org/en/latest/topics/architecture.html
二 安装
#Windows平台
1、pip3 install wheel #安装后,便支持通过wheel文件安装软件,wheel文件官网:https://www.lfd.uci.edu/~gohlke/pythonlibs
3、pip3 install lxml
4、pip3 install pyopenssl
5、下载并安装pywin32:https://sourceforge.net/projects/pywin32/files/pywin32/
6、下载twisted的wheel文件:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
7、执行pip3 install 下载目录\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
8、pip3 install scrapy
#Linux平台
1、pip3 install scrapy
三 命令行工具
Global commands: scrapy + (下面的命令)
scrapy startproject *** #创建项目 ***(这是项目名,不是爬虫名) cd myproject
scrapy genspider baidu www.baidu.com #创建爬虫程序baidu.py,以及爬虫名称 #百度网址是域名,不一定非要有,有的话只能爬百度开头的网址
# 一个项目可以有多个爬虫,使用 scrapy list 查看
scrapy settings #如果是在项目目录下,则得到的是该项目的配置 scrapy runspider baidu.py # 爬虫文件的绝对路径(不是项目文件)
#运行一个独立的python文件,不必创建项目 scrapy shell http://www.baidu.com #在交互式调试,如选择器规则正确与否 response.status / text / body
response.xpth()
scrapy view(response) #下载到本地,完毕后直接弹出浏览器,以此可以分辨出哪些数据是ajax请求
Project-only commands:
scrapy crawl baidu (爬虫名称) #运行爬虫,必须创建项目才行,确保配置文件中ROBOTSTXT_OBEY = False scrapy check #检测项目中有无语法错误 scrapy list #列出项目中所包含的爬虫名 scrapy parse http://www.baodu.com/ --callback parse #以此可以验证我们的回调函数是否正确 scrapy bench #scrapy bentch压力测试,看每分钟爬多少页面 官网链接 https://docs.scrapy.org/en/latest/topics/commands.html
#1、执行全局命令:请确保不在某个项目的目录下,排除受该项目配置的影响
scrapy startproject MyProject
cd MyProject
scrapy genspider baidu www.baidu.com
scrapy settings --get XXX #如果切换到项目目录下,看到的则是该项目的配置
scrapy runspider baidu.py
scrapy shell https://www.baidu.com
response
response.status
response.body
view(response)
scrapy view https://www.taobao.com #如果页面显示内容不全,不全的内容则是ajax请求实现的,以此快速定位问题
scrapy fetch --nolog --headers https://www.taobao.com
scrapy version #scrapy的版本
scrapy version -v #依赖库的版本
#2、执行项目命令:切到项目目录下
scrapy crawl baidu
scrapy check
scrapy list
scrapy parse http://quotes.toscrape.com/ --callback parse
scrapy bench