【发布时间】:2017-06-17 03:41:56
【问题描述】:
我使用 scrapy 解析以下站点:http://www.banki.ru/services/responses/。当我通过 shell 逐步解析时,一切正常,即这条线有效:
response.xpath("//script[contains(., 'banksData')]/text()").re(r'"name":"(.*?)","code"')
但是当我开始爬行时,我得到了以下日志。
2017-06-16 20:59:27 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: banksru)
2017-06-16 20:59:27 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'banksru', 'FEED_FORMAT': 'json', 'NEWSPIDER_MODULE': 'banksru.spiders', 'SPIDER_MODULES': ['banksru.spiders'], 'FEED_URI': 'banki.json'}
2017-06-16 20:59:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.feedexport.FeedExporter']
2017-06-16 20:59:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-16 20:59:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-16 20:59:28 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-06-16 20:59:28 [scrapy.core.engine] INFO: Spider opened
2017-06-16 20:59:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-06-16 20:59:28 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-06-16 20:59:28 [scrapy.core.engine] DEBUG: Crawled (429) <GET http://www.banki.ru/services/responses/> (referer: None)
2017-06-16 20:59:28 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 http://www.banki.ru/services/responses/>: HTTP status code is not handled or not allowed
2017-06-16 20:59:28 [scrapy.core.engine] INFO: Closing spider (finished)
2017-06-16 20:59:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 229,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 119,
'downloader/response_count': 1,
'downloader/response_status_count/429': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 6, 16, 17, 59, 28, 827696),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/429': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 8,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 6, 16, 17, 59, 28, 573054)}
2017-06-16 20:59:28 [scrapy.core.engine] INFO: Spider closed (finished)
我知道该站点存在机器人阻塞和用户代理问题,因此我更改了 settings.py 我的项目的 Scrapy 设置
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'banksru'
SPIDER_MODULES = ['banksru.spiders']
NEWSPIDER_MODULE = 'banksru.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'www.example.com'
# Obey robots.txt rulesROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'banksru.middlewares.BanksruSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'banksru.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'banksru.pipelines.BanksruPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
我尝试实现的代码很简单:
import scrapy
class BankRating(scrapy.Spider):
name = "banki"
start_urls = [
"http://www.banki.ru/services/responses/",
]
def parse(self, response):
name = response.xpath("//script[contains(., 'banksData')]/text()").re(r'"name":"(.*?)","code"')
rating = response.xpath("//script[contains(., 'ratingData')]/text()").re(r'"rating":(.*?),"responseCount"')
avg_grade = response.xpath("//script[contains(., 'ratingData')]/text()").re(r'"middleGrade":(.*?),"middleRating"')
checked_responses = response.xpath(
"//script[contains(., 'ratingData')]/text()").re(r'"checkedResponseCount":(.*?),"checkedResponseCountForYear"')
num_responses = response.xpath("//script[contains(., 'ratingData')]/text()").re(r'"responseCount":(.*?),"responseCountForYear"')
solved_problems = response.xpath(
"//script[contains(., 'ratingData')]/text()").re(r'"solvedResponseCount":(.*?),"withAgentAnswer"')
bank_answers = response.xpath("//script[contains(., 'ratingData')]/text()").re(r'"withAgentAnswer":(.*?),"middleGrade"')
yield name, rating, avg_grade, checked_responses, num_responses, solved_problems, bank_answers
我的机器是win8.1,scrapy是为python 3.5安装的。提前感谢您的任何帮助
【问题讨论】:
-
您收到的 HTTP 代码似乎是 429,这意味着在给定时间段内发送的请求过多。
-
尝试在 settings.py 中使用 AUTOTHROTTLE_ENABLED = True
-
@KaushikNP 谢谢。删除了 user_agent、ROBOTSTXT_OBEY、AUTOTHROTTLE_ENABLED 的 # 并且进展顺利。
-
当然。很高兴能提供帮助。
-
删除主题标签以取消注释代码行以使其工作很重要,哈哈