使用scrapy时访问网页答案

【问题标题】：webpage access while using scrapy使用scrapy时访问网页
【发布时间】：2016-10-19 17:55:24
【问题描述】：

我是 python 和 scrapy 的新手。我按照教程并尝试抓取一些网页。我使用了 tutorial 中的代码并分别替换了 URL - http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0 和 http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1。

生成 html 文件时，不会显示整个数据。仅显示此 URL - http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=0&sd=0&states=ALL&near=&ps=20&p=0 之前的数据。

此外，当命令运行时，第二个 URL 已被删除，说明它是重复的，并且只创建了一个 html 文件。

我想知道网页是否拒绝访问该特定数据，或者我应该更改我的代码以获取精确数据。

当我进一步给出 shell 命令时，我得到了错误。我使用 crawl 命令和 shell 命令时的结果是 -

    C:\Users\MinorMiracles\Desktop\tutorial>python -m scrapy.cmdline crawl citydata
2016-10-19 12:00:27 [scrapy] INFO: Scrapy 1.2.0 started (bot: tutorial)
2016-10-19 12:00:27 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tu
torial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True,
 'BOT_NAME': 'tutorial'}
2016-10-19 12:00:27 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-10-19 12:00:27 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-19 12:00:27 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-19 12:00:27 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-19 12:00:27 [scrapy] INFO: Spider opened
2016-10-19 12:00:27 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-10-19 12:00:27 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-19 12:00:27 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.
city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=
&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX
&i6819=1&ps=20&p=1> - no more duplicates will be shown (see DUPEFILTER_DEBUG to
show all duplicates)
2016-10-19 12:00:28 [scrapy] DEBUG: Crawled (200) <GET http://www.city-data.com/
robots.txt> (referer: None)
2016-10-19 12:00:29 [scrapy] DEBUG: Crawled (200) <GET http://www.city-data.com/
advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=691
4&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20
&p=0> (referer: None)
2016-10-19 12:00:29 [citydata] DEBUG: Saved file citydata-advanced.html
2016-10-19 12:00:29 [scrapy] INFO: Closing spider (finished)
2016-10-19 12:00:29 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 459,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 44649,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'dupefilter/filtered': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 10, 19, 6, 30, 29, 751000),
 'log_count/DEBUG': 5,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 10, 19, 6, 30, 27, 910000)}
2016-10-19 12:00:29 [scrapy] INFO: Spider closed (finished)

C:\Users\MinorMiracles\Desktop\tutorial>python -m scrapy.cmdline shell 'http://w
ww.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&ne
ar=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=
MAX&i6819=1&ps=20&p=0'
2016-10-19 12:21:51 [scrapy] INFO: Scrapy 1.2.0 started (bot: tutorial)
2016-10-19 12:21:51 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tu
torial.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy.dupefilters
.BaseDupeFilter', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'
, 'LOGSTATS_INTERVAL': 0}
2016-10-19 12:21:51 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-10-19 12:21:51 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-19 12:21:51 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-19 12:21:51 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-19 12:21:51 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-19 12:21:51 [scrapy] INFO: Spider opened
2016-10-19 12:21:53 [scrapy] DEBUG: Retrying <GET http://'http:/robots.txt> (fai
led 1 times): DNS lookup failed: address "'http:" not found: [Errno 11004] getad
drinfo failed.
2016-10-19 12:21:56 [scrapy] DEBUG: Retrying <GET http://'http:/robots.txt> (fai
led 2 times): DNS lookup failed: address "'http:" not found: [Errno 11004] getad
drinfo failed.
2016-10-19 12:21:58 [scrapy] DEBUG: Gave up retrying <GET http://'http:/robots.t
xt> (failed 3 times): DNS lookup failed: address "'http:" not found: [Errno 1100
4] getaddrinfo failed.
2016-10-19 12:21:58 [scrapy] ERROR: Error downloading <GET http://'http:/robots.
txt>: DNS lookup failed: address "'http:" not found: [Errno 11004] getaddrinfo f
ailed.
DNSLookupError: DNS lookup failed: address "'http:" not found: [Errno 11004] get
addrinfo failed.
2016-10-19 12:22:00 [scrapy] DEBUG: Retrying <GET http://'http://www.city-data.c
om/advanced/search.php#body?fips=0> (failed 1 times): DNS lookup failed: address
 "'http:" not found: [Errno 11004] getaddrinfo failed.
2016-10-19 12:22:03 [scrapy] DEBUG: Retrying <GET http://'http://www.city-data.c
om/advanced/search.php#body?fips=0> (failed 2 times): DNS lookup failed: address
 "'http:" not found: [Errno 11004] getaddrinfo failed.
2016-10-19 12:22:05 [scrapy] DEBUG: Gave up retrying <GET http://'http://www.cit
y-data.com/advanced/search.php#body?fips=0> (failed 3 times): DNS lookup failed:
 address "'http:" not found: [Errno 11004] getaddrinfo failed.
Traceback (most recent call last):
  File "C:\Python27\lib\runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "C:\Python27\lib\runpy.py", line 72, in _run_code
    exec code in run_globals
  File "C:\Python27\lib\site-packages\scrapy\cmdline.py", line 161, in <module>
    execute()
  File "C:\Python27\lib\site-packages\scrapy\cmdline.py", line 142, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "C:\Python27\lib\site-packages\scrapy\cmdline.py", line 88, in _run_print
_help
    func(*a, **kw)
  File "C:\Python27\lib\site-packages\scrapy\cmdline.py", line 149, in _run_comm
and
    cmd.run(args, opts)
  File "C:\Python27\lib\site-packages\scrapy\commands\shell.py", line 71, in run

    shell.start(url=url)
  File "C:\Python27\lib\site-packages\scrapy\shell.py", line 47, in start
    self.fetch(url, spider)
  File "C:\Python27\lib\site-packages\scrapy\shell.py", line 112, in fetch
    reactor, self._schedule, request, spider)
  File "C:\Python27\lib\site-packages\twisted\internet\threads.py", line 122, in
 blockingCallFromThread
    result.raiseException()
  File "<string>", line 2, in raiseException
twisted.internet.error.DNSLookupError: DNS lookup failed: address "'http:" not f
ound: [Errno 11004] getaddrinfo failed.
'csize' is not recognized as an internal or external command,
operable program or batch file.
'sc' is not recognized as an internal or external command,
operable program or batch file.
'sd' is not recognized as an internal or external command,
operable program or batch file.
'states' is not recognized as an internal or external command,
operable program or batch file.
'near' is not recognized as an internal or external command,
operable program or batch file.
'nam_crit1' is not recognized as an internal or external command,
operable program or batch file.
'b6914' is not recognized as an internal or external command,
operable program or batch file.
'e6914' is not recognized as an internal or external command,
operable program or batch file.
'i6914' is not recognized as an internal or external command,
operable program or batch file.
'nam_crit2' is not recognized as an internal or external command,
operable program or batch file.
'b6819' is not recognized as an internal or external command,
operable program or batch file.
'e6819' is not recognized as an internal or external command,
operable program or batch file.
'i6819' is not recognized as an internal or external command,
operable program or batch file.
'ps' is not recognized as an internal or external command,
operable program or batch file.
'p' is not recognized as an internal or external command,
operable program or batch file.

我的代码是

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "citydata"

    def start_requests(self):
        urls = [
            'http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0',
            'http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'citydata-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

请有人指导我完成此操作。

【问题讨论】：

很高兴看到你的蜘蛛。错误消息看起来像 Python 会尝试将您的 URL 用作变量——也许某处缺少一些引号。
@GHajba 我已在问题中插入代码。如果代码有任何错误，请告诉我
似乎问题出在您的项目而不是您的蜘蛛，因为如果我运行 shell 命令，我不会收到任何错误 - 但当然我没有使用机器人 tutorial，如您所见在控制台输出中。
@GHajba 我应该怎么做才能解决这个问题？该网页是否不授予访问权限？
请注意，在 Windows 上，单引号 (') 无法按预期工作（请参阅 this recent bug report）。所以你应该使用双引号（"）和scrapy shell "url"，例如python -m scrapy.cmdline shell "http://w ww.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&ne ar=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819= MAX&i6819=1&ps=20&p=0"

标签： python python-2.7 web-scraping scrapy scrapy-spider

【解决方案1】：

首先，这个网站看起来像是一个重 JavaScript 的网站。 Scrapy 本身只从服务器下载 HTML，但不解释 JavaScript 语句。

其次，URL 片段（即包括#body 和之后的所有内容）不会发送到服务器，仅获取http://www.city-data.com/advanced/search.php（scrapy 与您的浏览器相同。您可以使用浏览器的开发工具网络选项卡进行确认。）

所以对于 Scrapy，对

的请求

http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0

和

http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1

是相同的资源，所以它只获取一次。它们的区别仅在于它们的 URL 片段。

您需要的是一个 JavaScript 渲染器。你可以使用 Selenium 或类似 Splash 的东西。我建议使用 scrapy-splash plugin，其中包含一个考虑 URL 片段的重复过滤器。

【讨论】：