【问题标题】:scrapy works in shell, but crawles 0 pagesscrapy 在 shell 中工作,但抓取 0 页
【发布时间】:2017-06-17 03:41:56
【问题描述】:

我使用 scrapy 解析以下站点:http://www.banki.ru/services/responses/。当我通过 shell 逐步解析时,一切正常,即这条线有效:

response.xpath("//script[contains(., 'banksData')]/text()").re(r'"name":"(.*?)","code"')

但是当我开始爬行时,我得到了以下日志。

2017-06-16 20:59:27 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: banksru)
2017-06-16 20:59:27 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'banksru', 'FEED_FORMAT': 'json', 'NEWSPIDER_MODULE': 'banksru.spiders', 'SPIDER_MODULES': ['banksru.spiders'], 'FEED_URI': 'banki.json'}
2017-06-16 20:59:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.feedexport.FeedExporter']
2017-06-16 20:59:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-16 20:59:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-16 20:59:28 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-06-16 20:59:28 [scrapy.core.engine] INFO: Spider opened
2017-06-16 20:59:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-06-16 20:59:28 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-06-16 20:59:28 [scrapy.core.engine] DEBUG: Crawled (429) <GET http://www.banki.ru/services/responses/> (referer: None)
2017-06-16 20:59:28 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 http://www.banki.ru/services/responses/>: HTTP status code is not handled or not allowed
2017-06-16 20:59:28 [scrapy.core.engine] INFO: Closing spider (finished)
2017-06-16 20:59:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 229,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 119,
 'downloader/response_count': 1,
 'downloader/response_status_count/429': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 6, 16, 17, 59, 28, 827696),
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/429': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 8,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 6, 16, 17, 59, 28, 573054)}
2017-06-16 20:59:28 [scrapy.core.engine] INFO: Spider closed (finished)

我知道该站点存在机器人阻塞和用户代理问题,因此我更改了 settings.py 我的项目的 Scrapy 设置

# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'banksru'

SPIDER_MODULES = ['banksru.spiders']
NEWSPIDER_MODULE = 'banksru.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'www.example.com'

# Obey robots.txt rulesROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'banksru.middlewares.BanksruSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'banksru.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'banksru.pipelines.BanksruPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

我尝试实现的代码很简单:

import scrapy

class BankRating(scrapy.Spider):
    name = "banki"
    start_urls = [
        "http://www.banki.ru/services/responses/",
    ]


    def parse(self, response):
        name = response.xpath("//script[contains(., 'banksData')]/text()").re(r'"name":"(.*?)","code"')
        rating = response.xpath("//script[contains(., 'ratingData')]/text()").re(r'"rating":(.*?),"responseCount"')
        avg_grade = response.xpath("//script[contains(., 'ratingData')]/text()").re(r'"middleGrade":(.*?),"middleRating"')
        checked_responses = response.xpath(
            "//script[contains(., 'ratingData')]/text()").re(r'"checkedResponseCount":(.*?),"checkedResponseCountForYear"')
        num_responses = response.xpath("//script[contains(., 'ratingData')]/text()").re(r'"responseCount":(.*?),"responseCountForYear"')
        solved_problems = response.xpath(
            "//script[contains(., 'ratingData')]/text()").re(r'"solvedResponseCount":(.*?),"withAgentAnswer"')
        bank_answers = response.xpath("//script[contains(., 'ratingData')]/text()").re(r'"withAgentAnswer":(.*?),"middleGrade"')
        yield name, rating, avg_grade, checked_responses,  num_responses, solved_problems, bank_answers

我的机器是win8.1,scrapy是为python 3.5安装的。提前感谢您的任何帮助

【问题讨论】:

  • 您收到的 HTTP 代码似乎是 429,这意味着在给定时间段内发送的请求过多。
  • 尝试在 settings.py 中使用 AUTOTHROTTLE_ENABLED = True
  • @KaushikNP 谢谢。删除了 user_agent、ROBOTSTXT_OBEY、AUTOTHROTTLE_ENABLED 的 # 并且进展顺利。
  • 当然。很高兴能提供帮助。
  • 删除主题标签以取消注释代码行以使其工作很重要,哈哈

标签: python scrapy


【解决方案1】:

Scrapy 作为机器人在服务器上非常资源密集型,因为它非常快速并进行异步调用,导致一些清晰的需要遵循的指导方针。这样做是为了让爬取工作以更加宽容和友好的方式工作,而不会对网络造成任何损害。 这些在 Valdir Stumm Jr. 的博客 How to Crawl the Web Politely with Scrapy 中得到了精美的强调。

  • 网站所有者使用robots.txt file 文件向网络机器人提供有关其网站的说明;这称为机器人排除协议。该文件通常在网站的根目录中可用,您的爬虫应遵循因此定义的规则

  • 网站可以处理的请求数量差异很大。 AutoThrottle 根据当前 Web 服务器负载自动调整请求之间的延迟。它首先计算一个请求的延迟。然后它将调整同一域的请求之间的延迟,使同时激活的请求不超过 AUTOTHROTTLE_TARGET_CONCURRENCY。

settings.py 中启用这些应该允许scrapy 抓取网站。感谢 @Ding 指出“HTTP 代码 429:这意味着在给定时间段内发送过多请求”

【讨论】:

    【解决方案2】:

    当请求出现问题时,网站会尝试保护自己响应不同的状态。

    这种特殊情况很常见但很简单。您可以使用常见的USER_AGENT 绕过它:

    settings.py

    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:50.0) Gecko/20100101 Firefox/50.0'
    

    因为scrapy 默认使用类似:

    "Scrapy/1.3.0 (+http://scrapy.org)"
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2018-09-17
      • 2016-10-18
      • 2023-02-15
      • 2018-09-22
      • 1970-01-01
      • 1970-01-01
      • 2018-07-06
      相关资源
      最近更新 更多