【发布时间】:2013-03-08 04:10:41
【问题描述】:
我无法爬取整个网站,Scrapy 只是在表面上爬,我想爬得更深。过去 5-6 小时一直在谷歌搜索,但没有任何帮助。我的代码如下:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
class ExampleSpider(CrawlSpider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
rules = [Rule(SgmlLinkExtractor(allow=()),
follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]
def parse_item(self,response):
self.log('A response from %s just arrived!' % response.url)
【问题讨论】:
-
刚刚尝试了针对 stackoverflow 的代码 - 我的 ip 被禁止了。它绝对有效! :)
-
@Alexander - 听起来鼓励我进行更多调试 :) :) ... 对不起,IP 禁令队友!
-
您真的要抓取 example.com 吗?你知道那不是一个真正的网站。
-
您要爬取哪个网站?
-
"example.com" 仅用于代表目的。我正在尝试抓取 landmarkshops.com
标签: web web-scraping scrapy