【发布时间】:2014-03-07 18:37:11
【问题描述】:
我正在尝试制作一个抓取工具,可以提取链接、标题、价格和 craigslist 上的帖子正文。我已经能够获得价格,但它会返回页面上每个列表的价格,而不仅仅是特定行的价格。我也无法让它转到下一页并继续抓取。
这是我正在使用的教程 - http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/
我已经尝试过这个帖子的建议,但仍然无法成功 - Scrapy Python Craigslist Scraper
我要抓取的页面是 - http://medford.craigslist.org/cto/
在链接价格变量中,如果我在 span[@class="l2"] 之前删除 // 它将不返回任何价格,但如果我将它留在那里它包括页面上的每个价格。
对于规则,我尝试过使用类标签,但它似乎挂在第一页上。我在想我可能需要单独的蜘蛛类?
这是我的代码:
#-------------------------------------------------------------------------------
# Name: module1
# Purpose:
#
# Author: CD
#
# Created: 02/03/2014
# Copyright: (c) CD 2014
# Licence: <your licence>
#-------------------------------------------------------------------------------
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.selector import *
import sys
class PageSpider(BaseSpider):
name = "cto"
allowed_domains = ["medford.craigslist.org"]
start_urls = ["http://medford.craigslist.org/cto/"]
rules = (Rule(SgmlLinkExtractor(allow=("index\d00\.html", ), restrict_xpaths=('//span[@class="button next"]' ,))
, callback="parse", follow=True), )
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//span[@class="pl"] | //span[@class="l2"]')
for title in titles:
item = CraigslistSampleItem()
item['title'] = title.select("a/text()").extract()
item['link'] = title.select("a/@href").extract()
item['price'] = title.select('//span[@class="l2"]//span[@class="price"]/text()').extract()
url = 'http://medford.craigslist.org{}'.format(''.join(item['link']))
yield Request(url=url, meta={'item': item}, callback=self.parse_item_page)
def parse_item_page(self, response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['description'] = hxs.select('//section[@id="postingbody"]/text()').extract()
return item
【问题讨论】:
标签: python recursion xpath web-scraping scrapy