【发布时间】:2014-08-16 21:09:56
【问题描述】:
我想获取一些工作的网站地址,所以我写了一个scrapy蜘蛛,我想用xpath://article/dl/dd/h2/a[@class="job-title"]/@href,获取所有值但是当我用命令执行蜘蛛时:
scrapy spider auseek -a addsthreshold=3
用于保存值的变量“urls”是空的,谁能帮我弄清楚,
这是我的代码:
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.conf import settings
from scrapy.mail import MailSender
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exceptions import CloseSpider
from scrapy import log
from scrapy import signals
from myProj.items import ADItem
import time
class AuSeekSpider(CrawlSpider):
name = "auseek"
result_address = []
addressCount = int(0)
addressThresh = int(0)
allowed_domains = ["seek.com.au"]
start_urls = [
"http://www.seek.com.au/jobs/in-australia/"
]
def __init__(self,**kwargs):
super(AuSeekSpider, self).__init__()
self.addressThresh = int(kwargs.get('addsthreshold'))
print 'init finished...'
def parse_start_url(self,response):
print 'This is start url function'
log.msg("Pipeline.spider_opened called", level=log.INFO)
hxs = Selector(response)
urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract()
print 'urls is:',urls
print 'test element:',urls[0].encode("ascii")
for url in urls:
postfix = url.getAttribute('href')
print 'postfix:',postfix
url = urlparse.urljoin(response.url,postfix)
yield Request(url, callback = self.parse_ad)
return
def parse_ad(self, response):
print 'this is parse_ad function'
hxs = Selector(response)
item = ADItem()
log.msg("Pipeline.parse_ad called", level=log.INFO)
item['name'] = str(self.name)
item['picNum'] = str(6)
item['link'] = response.url
item['date'] = time.strftime('%Y%m%d',time.localtime(time.time()))
self.addressCount = self.addressCount + 1
if self.addressCount > self.addressThresh:
raise CloseSpider('Get enough website address')
return item
问题是:
urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract()
当我尝试打印出来时,urls 是空的,我只是想不通为什么它不起作用以及如何纠正它,谢谢你的帮助。
【问题讨论】:
标签: python scrapy element web-crawler