【发布时间】:2014-04-26 04:27:20
【问题描述】:
我在http://www.shop.ginakdesigns.com/main.sc 上写了一个潦草的涂鸦,试图收集物品
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from .. import items
class GinakSpider(CrawlSpider):
name = "ginak"
start_urls = [
"http://www.shop.ginakdesigns.com/main.sc"
]
rules = [Rule(SgmlLinkExtractor(allow=[r'category\.sc\?categoryId=\d+'])),
Rule(SgmlLinkExtractor(allow=[r'product\.sc\?productId=\d+&categoryId=\d+']), callback='parse_item')]
def parse_item(self, response):
sel = Selector(response)
self.log(response.url)
item = items.GinakItem()
item['name'] = sel.xpath('//*[@id="wrapper2"]/div/div/div[1]/div/div/div[2]/div/div/div[1]/div[1]/h2/text()').extract()
item['price'] = sel.xpath('//*[@id="listPrice"]/text()').extract()
item['description'] = sel.xpath('//*[@id="wrapper2"]/div/div/div[1]/div/div/div[2]/div/div/div[1]/div[4]/div/p/text()').extract()
item['category'] = sel.xpath('//*[@id="breadcrumbs"]/a[2]/text()').extract()
return item
但是它不会超出主页进入任何链接。我已经尝试了各种方法并检查了 SgmlLinkExtractor 的正则表达式。这里有什么问题吗?
【问题讨论】:
标签: python html web-scraping web-crawler scrapy