【问题标题】:Scrapy and Selenium : only scrape two pagesScrapy 和 Selenium:只刮两页
【发布时间】:2014-10-02 15:34:45
【问题描述】:

我要爬一个网站,有10多个页面
每页有10个链接,蜘蛛会得到链接def parse():
并转到链接以抓取我想要的另一个数据def parse_detail():

请指导我如何编写只抓取两个页面而不是所有页面 THX 这是我的代码,它只抓取一页,然后蜘蛛关闭了

def __init__(self):
    self.driver = webdriver.Firefox()
    dispatcher.connect(self.spider_closed, signals.spider_closed)

def parse(self, response):
    self.driver.implicitly_wait(20) 
    self.driver.get(response.url)
    sites = self.driver.find_elements_by_css_selector("")
    for site in sites:
        item = CItem()
        linkiwant = site.find_element_by_css_selector(" ") 
        start = site.find_element_by_css_selector(" ")  
        item['link'] = linkiwant.get_attribute("href") 
        item['start_date']  = start.text
        yield Request(url=item['link'], meta={'item':item}, callback=self.parse_detail)  

    #how to write to only catch 2 pages??
    i=0
    if i< 2:
        try:
            next = self.driver.find_element_by_xpath("/li[@class='p_next'][1]")   
            next_page = next.text
            if next_page == "next_page":  
                next.click()    
                self.driver.refresh()  
                yield Request(self.driver.current_url, callback=self.parse)
                i+=1
        except:
             print "page not found"     
def parse_detail(self,response):
    item = response.meta['item']
    self.driver.implicitly_wait(20)  
    self.driver.get(response.url)
    sel = Selector(response)
    sites = sel.css("")            
    for site in sites:
        item['title'] = site.css(" ").extract()[0] 
        item['titleURL'] = site.css(" ").extract()[0]
        ..
        yield item   
def spider_closed(self, spider):
    self.driver.close()

【问题讨论】:

    标签: python selenium-webdriver scrapy


    【解决方案1】:

    使i 持久化:

    def __init__(self):
        self.page_num = 0
        self.driver = webdriver.Firefox()
        dispatcher.connect(self.spider_closed, signals.spider_closed)
    
        #how to write to only catch 2 pages??
        if self.page_num < 2:
            try:
                next = self.driver.find_element_by_xpath("/li[@class='p_next'][1]")   
                next_page = next.text
                if next_page == "next_page":  
                    next.click()    
                    self.driver.refresh()  
                    yield Request(self.driver.current_url, callback=self.parse)
                    self.page_num += 1
            except:
                 print "page not found"
    

    【讨论】:

    • 我有一个问题。如果一个网站有 10 个页面,但我将 self.page_num
    • 试试看!!我猜它可能会停止,但不要相信我的话。您可以在if next_page == "next_page": 之后添加else: print 'no next page' 以查看是否结束该功能。
    猜你喜欢
    • 2015-04-02
    • 2013-09-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-07-16
    相关资源
    最近更新 更多