【问题标题】:How to save crawled web pages in memory using scrapy如何使用scrapy将抓取的网页保存在内存中
【发布时间】:2016-06-09 11:02:11
【问题描述】:

我可以使用以下scrapy脚本在网络上爬行

 import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from lxml import html

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

from scrapy.spider import BaseSpider
from scrapy import log

#from tutorial.items import TutorialItem
from tutorial.items import DmozItem


class StayuncleCrawlerSpider(CrawlSpider):

    name = 'stayuncle_crawler'

    allowed_domains = ['stayuncle.com']
    start_urls = ['http://www.stayuncle.com/']
    CrawlSpider.DOWNLOAD_DELAY=.25;



    rules = [Rule(SgmlLinkExtractor(), callback='parse_item', follow=True)     ]

def parse_item(self,response,spider):

             doc = html.fromstring(response.body)
             item = DmozItem()
             item['title'] = doc.xpath('//meta[@property="og:title"]/@content')
             item['link'] = response.url
             item['desc'] = doc.xpath('//meta[@name="description"]/@content')
             yield self.parse_save(self,response)
             yield item



    # self.log('A response from %s just arrived!' % response.url)

def parse_save(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

这是日志

/Users/Nand/crawledData/tutorial/tutorial/spiders/stack_crawler.py:16: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
  Rule(SgmlLinkExtractor(allow=('pages/')), callback='parse_item', follow=True),
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:7: ScrapyDeprecationWarning: Module `scrapy.contrib.spiders` is deprecated, use `scrapy.spiders` instead
  from scrapy.contrib.spiders import CrawlSpider, Rule
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:8: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors` is deprecated, use `scrapy.linkextractors` instead
  from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:8: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors.sgml` is deprecated, use `scrapy.linkextractors.sgml` instead
  from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:11: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
  from scrapy.spider import BaseSpider
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:12: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
  from scrapy import log
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:28: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
  rules = [Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:29: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
  Rule(SgmlLinkExtractor(), callback='parse_save', follow=True)
2016-06-09 17:13:28 [scrapy] INFO: Scrapy 1.1.0 started (bot: tutorial)
2016-06-09 17:13:28 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'}
2016-06-09 17:13:28 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-06-09 17:13:28 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-06-09 17:13:28 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-06-09 17:13:28 [scrapy] INFO: Enabled item pipelines:
[]
2016-06-09 17:13:28 [scrapy] INFO: Spider opened
2016-06-09 17:13:28 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-06-09 17:13:28 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-09 17:13:28 [py.warnings] WARNING: /usr/local/lib/python2.7/site-packages/scrapy/core/downloader/__init__.py:65: UserWarning: StayuncleCrawlerSpider.DOWNLOAD_DELAY attribute is deprecated, use StayuncleCrawlerSpider.download_delay instead
  (type(spider).__name__, type(spider).__name__))

2016-06-09 17:13:29 [scrapy] DEBUG: Crawled (404) <GET http://www.stayuncle.com/robots.txt> (referer: None)
2016-06-09 17:13:29 [scrapy] DEBUG: Redirecting (302) to <GET http://www.stayuncle.com/home> from <GET http://www.stayuncle.com/>
2016-06-09 17:13:29 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/home> (referer: None)
2016-06-09 17:13:29 [scrapy] DEBUG: Filtered offsite request to 'stayuncle.tumblr.com': <GET http://stayuncle.tumblr.com/>
2016-06-09 17:13:29 [scrapy] DEBUG: Filtered offsite request to 'facebook.com': <GET http://facebook.com/stayuncle>
2016-06-09 17:13:29 [scrapy] DEBUG: Filtered offsite request to 'twitter.com': <GET http://twitter.com/stayuncle>
2016-06-09 17:13:30 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/home> (referer: http://www.stayuncle.com/home)
2016-06-09 17:13:30 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.stayuncle.com/home> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-06-09 17:13:30 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/cdn-cgi/l/email-protection> (referer: http://www.stayuncle.com/home)
2016-06-09 17:13:30 [scrapy] DEBUG: Filtered offsite request to 'www.cloudflare.com': <GET https://www.cloudflare.com/sign-up?utm_source=email_protection>
2016-06-09 17:13:30 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/career> (referer: http://www.stayuncle.com/home)
2016-06-09 17:13:30 [scrapy] DEBUG: Filtered offsite request to 'www.facebook.com': <GET https://www.facebook.com/StayUncle?ref=hl>
2016-06-09 17:13:30 [scrapy] DEBUG: Filtered offsite request to 'www.twitter.com': <GET https://www.twitter.com/stayuncle>
2016-06-09 17:13:31 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/howwechose> (referer: http://www.stayuncle.com/home)
2016-06-09 17:13:31 [scrapy] DEBUG: Crawled (404) <GET http://www.stayuncle.com/index.html> (referer: http://www.stayuncle.com/career)
2016-06-09 17:13:31 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/about> (referer: http://www.stayuncle.com/home)
2016-06-09 17:13:31 [scrapy] DEBUG: Ignoring response <404 http://www.stayuncle.com/index.html>: HTTP status code is not handled or not allowed
2016-06-09 17:13:31 [scrapy] DEBUG: Filtered offsite request to 'in.linkedin.com': <GET https://in.linkedin.com/pub/nand-singh/1b/31b/464>
2016-06-09 17:13:31 [scrapy] INFO: Closing spider (finished)
2016-06-09 17:13:31 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2748,
 'downloader/request_count': 9,
 'downloader/request_method_count/GET': 9,
 'downloader/response_bytes': 32186,
 'downloader/response_count': 9,
 'downloader/response_status_count/200': 6,
 'downloader/response_status_count/302': 1,
 'downloader/response_status_count/404': 2,
 'dupefilter/filtered': 23,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 6, 9, 11, 43, 31, 709558),
 'log_count/DEBUG': 19,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'offsite/domains': 7,
 'offsite/filtered': 22,
 'request_depth_max': 2,
 'response_received_count': 8,
 'scheduler/dequeued': 8,
 'scheduler/dequeued/memory': 8,
 'scheduler/enqueued': 8,
 'scheduler/enqueued/memory': 8,
 'start_time': datetime.datetime(2016, 6, 9, 11, 43, 28, 793762)}
2016-06-09 17:13:31 [scrapy] INFO: Spider closed (finished)

但我想以 html 的形式保存所有已抓取的网页?我已尝试按照http://doc.scrapy.org/en/latest/intro/tutorial.html 中的说明保存已抓取的网页,但这对我不起作用。有人可以指导我使用一些代码快照,以便我可以实现这一目标。

【问题讨论】:

  • “在内存中”是什么意思? Scrapy 教程显示了一个example writting raw HTML to disk。它可以帮助您入门。
  • @paultrmbrth 我已经更新了我的问题,我尝试了同样的方法,但它对我不起作用你能帮我弄清楚我做错了什么吗?
  • 请添加有关不工作的详细信息(例如,没有写入磁盘、异常、共享控制台日志等)
  • @paultrmbrth "NOT WORKING"::我无法通过磁盘保存文件。你能指出我做错了什么吗?我已经关注了你提到的链接。
  • yield self.parse_save,直接叫self.parse_save(...)

标签: web-scraping scrapy scrapy-spider scrapy-pipeline


【解决方案1】:

正确缩进后@def parse_item(self,response,spider): 此代码段有效。

【讨论】:

    猜你喜欢
    • 2020-02-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-04-02
    • 1970-01-01
    • 1970-01-01
    • 2014-09-15
    • 2023-04-03
    相关资源
    最近更新 更多