让 scrapy spider 抓取整个网站答案

【问题标题】：Get scrapy spider to crawl entire site让 scrapy spider 抓取整个网站
【发布时间】：2016-04-25 10:09:59
【问题描述】：

我正在使用 scrapy 抓取我拥有的旧网站，我使用下面的代码作为我的蜘蛛。我不介意为每个网页输出文件，或者包含其中所有内容的数据库。但我确实需要能够让蜘蛛爬取整个东西，而不必输入我目前必须做的每一个网址

import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["www.example.com"]
    start_urls = [
        "http://www.example.com/contactus"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

【问题讨论】：

标签： python scrapy scrapy-spider

【解决方案1】：

要抓取整个网站，您应该使用CrawlSpider 而不是scrapy.Spider

Here's an example

出于您的目的，请尝试使用以下内容：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

另外，看看这个article

【讨论】：

您可能希望将follow=True 添加到该规则中，以继续抓取链接。
@Daniil Mashkin，您的解决方案也帮助了我，谢谢，但现在我想知道 [-2] 是做什么用的？提前致谢。如何将所有抓取的链接保存到 .csv 文件中？如果我运行“scrapy crawl spider -o .example.csv”，我会得到一个空的 .csv 文件：/
@y.y 你应该有自己的Item class 和url 字段，并在parse_item 方法中返回它。然后你的scrapy crawl spider -o .example.csv 就可以正常工作了
或试试return {'url': response.url}
@Daniil Mashkin 我可以确认，谢谢。你知道我如何检查存储在 csv 文件中的所有抓取链接，以获取产品链接，也许还有一些类别链接？最后我得到了一个带有productfiles的csv和一个带有categorylinks的csv