【发布时间】:2020-11-03 22:15:21
【问题描述】:
我尝试过多个教程,但无论我尝试什么,我总是得到相同的结果“爬取 0 页(以 0 页/分钟),抓取 0 个项目(以 0 个项目/分钟)”
我的代码很简单:
import scrapy
class SpiderSpider(scrapy.Spider):
name = 'spider'
allowed_domains = ['books.toscrape.com/']
start_urls = ['http://books.toscrape.com//']
def parse(self, response):
print(response.url)
输出是:
2020-11-03 22:11:52 [scrapy.utils.log] 信息:Scrapy 2.4.0 已启动 (机器人:书籍)2020-11-03 22:11:52 [scrapy.utils.log] 信息:版本: lxml 4.5.2.0,libxml2 2.9.10,cssselect 1.1.0,解析 1.6.0,w3lib 1.22.0,Twisted 20.3.0,Python 3.8.3(默认,2020 年 7 月 2 日,11:26:31)-[Clang 10.0.0],pyOpenSSL 19.1.0(OpenSSL 1.1.1g 2020 年 4 月 21 日),密码学 2.9 .2、平台macOS-10.15.7-x86_64-i386-64bit 2020-11-03 22:11:52 [scrapy.utils.log] 调试:使用反应器: twisted.internet.selectreactor.SelectReactor 2020-11-03 22:11:52 [scrapy.crawler] 信息:覆盖设置:{'BOT_NAME':'books', 'NEWSPIDER_MODULE':'books.spiders','ROBOTSTXT_OBEY':真, 'SPIDER_MODULES':['books.spiders']} 2020-11-03 22:11:52 [scrapy.extensions.telnet] 信息:Telnet 密码:ae1669f089ac9e66 2020-11-03 22:11:52 [scrapy.middleware] 信息:启用的扩展: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2020-11-03 22:11:52 [scrapy.middleware] 信息:已启用下载器中间件: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2020-11-03 22:11:52 [scrapy.middleware] 信息:启用蜘蛛中间件: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2020-11-03 22:11:52 [scrapy.middleware] INFO:启用项目管道:[] 2020-11-03 22:11:52 [scrapy.core.engine] 信息:Spider 于 2020-11-03 22:11:52 打开 [scrapy.extensions.logstats] 信息:抓取 0 页(以 0 页/分钟), 刮掉 0 件(0 件/分钟) 2020-11-03 22:11:52 [scrapy.extensions.telnet] 信息:Telnet 控制台正在监听 127.0.0.1:6023 2020-11-03 22:11:53 [scrapy.core.engine] 调试:已爬网(404)
http://books.toscrape.com/robots.txt>(引用者:无) 2020-11-03 22:11:53 [scrapy.core.engine] 调试:已爬网(200) http://books.toscrape.com//>(推荐人:无) http://books.toscrape.com// 2020-11-03 22:11:53 [scrapy.core.engine] INFO:收尾蜘蛛(已完结) 2020-11-03 22:11:53 [scrapy.statscollectors] 信息:转储 Scrapy 统计信息: {'downloader/request_bytes':455,'downloader/request_count':2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 6065, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/404': 1, 'elapsed_time_seconds': 0.593427, 'finish_reason': 'finish', 'finish_time': datetime.datetime(2020, 11, 3, 22, 11, 53, 534397), 'log_count/DEBUG':2,'log_count/INFO':10,'memusage/max': 49852416, 'memusage/startup': 49852416, 'response_received_count': 2,'robotstxt/request_count':1,'robotstxt/response_count':1, “robotstxt/response_status_count/404”:1,“调度程序/出队”:1, “调度程序/出队/内存”:1,“调度程序/入队”:1, '调度程序/排队/内存':1,'start_time':datetime.datetime(2020, 11、3、22、11、52、940970)} 2020-11-03 22:11:53 [scrapy.core.engine] 信息:蜘蛛关闭(完成)
【问题讨论】:
标签: python python-3.x web-scraping scrapy