【发布时间】:2016-11-17 07:39:00
【问题描述】:
我正在尝试通过在具有分页功能的房地产网站上获取条目标题来学习 Scrapy。我无法从rules 列表中定义的“下一页”页面中获取条目。
代码:
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from property.items import PropertyItem
import re
class VivastreetSpider(CrawlSpider):
name = 'viva'
allowed_domains = ['http://chennai.vivastreet.co.in/']
start_urls = ['http://chennai.vivastreet.co.in/rent+chennai/']
rules = [
Rule(LinkExtractor(restrict_xpaths = ('//*[text()[contains(., "Next")]]')), callback = 'parse_item', follow = True)
]
def parse_item(self, response):
a = Selector(response).xpath('//a[contains(@id, "vs-detail-link")]/text()').extract()
i = 1
for b in a:
print('testtttttttttttttt ' + str(i) + '\n' + str(b))
i += 1
item = PropertyItem()
item['title'] = a[0]
yield item
Edit- 用 parse_item 替换了 parse 方法,现在不能抓取任何东西。
最后忽略项目对象代码,我打算将其替换为对另一个方法的请求回调,该方法从每个条目的 URL 获取更多详细信息。
如果需要,我会发布日志。
编辑 #2- 我从分页页面中获取 URL,然后向另一个方法发出请求,该方法最终从每个条目的页面中获取详细信息。 parse_start_url() 方法正在工作,但 parse_item method() 没有被调用。
代码:
from scrapy import Request
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from property.items import PropertyItem
import sys
reload(sys)
sys.setdefaultencoding('utf8') #To prevent UnicodeDecodeError, UnicodeEncodeError.
class VivastreetSpider(CrawlSpider):
name = 'viva'
allowed_domains = ['chennai.vivastreet.co.in']
start_urls = ['http://chennai.vivastreet.co.in/rent+chennai/']
rules = [
Rule(LinkExtractor(restrict_xpaths = '//*[text()[contains(., "Next")]]'), callback = 'parse_start_url', follow = True)
]
def parse_start_url(self, response):
urls = Selector(response).xpath('//a[contains(@id, "vs-detail-link")][@href]').extract()
print('test0000000000000000000' + str(urls[0]))
for url in urls:
yield Request(url = url, callback = self.parse_item)
def parse_item(self, response):
#item = PropertyItem()
a = Selector(response).xpath('//*h1[@class = "kiwii-font-xlarge kiwii-margin-none"').extract()
print('test tttttttttttttttttt ' + str(a))
【问题讨论】:
标签: python pagination scrapy