【问题标题】:Scrapy: Unsuccessful iterating over a list and paginationScrapy:迭代列表和分页不成功
【发布时间】:2015-05-29 19:35:22
【问题描述】:

我的目标是每页提取所有 25 行(每行 6 项),然后遍历 40 页中的每一页。

目前,我的蜘蛛从第 1-3 页提取第一行(请参阅 CSV 输出图像)。

我假设 list_iterator() 函数会遍历每一行;但是,我的ruleslist_iterator() 函数中似乎存在一个错误,不允许废弃每页的所有行。

非常感谢任何帮助或建议!

propub_spider.py:

import scrapy 
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from propub.items import PropubItem
from scrapy.http import Request

class propubSpider(CrawlSpider):
    name = 'prop$'
    allowed_domains = ['https://projects.propublica.org']
    max_pages = 40
    start_urls = [
        'https://projects.propublica.org/docdollars/search?state%5Bid%5D=33',
        'https://projects.propublica.org/docdollars/search?page=2&state%5Bid%5D=33',
        'https://projects.propublica.org/docdollars/search?page=3&state%5Bid%5D=33']

    rules = (Rule(SgmlLinkExtractor(allow=('\\search?page=\\d')), 'parse_start_url', follow=True),)

    def list_iterator(self):
        for i in range(self.max_pages):
            yield Request('https://projects.propublica.org/docdollars/search?page=d' % i, callback=self.parse)

    def parse(self, response):
        for sel in response.xpath('//*[@id="payments_list"]/tbody'):
            item = PropubItem()
            item['payee'] = sel.xpath('tr[1]/td[1]/a[2]/text()').extract()
            item['link'] = sel.xpath('tr[1]/td[1]/a[1]/@href').extract()
            item['city'] = sel.xpath('tr[1]/td[2]/text()').extract()
            item['state'] = sel.xpath('tr[1]/td[3]/text()').extract()
            item['company'] = sel.xpath('tr[1]/td[4]').extract()
            item['amount'] =  sel.xpath('tr[1]/td[7]/span/text()').extract()
            yield item 

pipelines.py:

import csv

class PropubPipeline(object):

    def __init__(self):
        self.myCSV = csv.writer(open('C:\Users\Desktop\propub.csv', 'wb'))
        self.myCSV.writerow(['payee', 'link', 'city', 'state', 'company', 'amount'])

    def process_item(self, item, spider):
        self.myCSV.writerow([item['payee'][0].encode('utf-8'), 
        item['link'][0].encode('utf-8'), 
        item['city'][0].encode('utf-8'), 
        item['state'][0].encode('utf-8'),
        item['company'][0].encode('utf-8'),
        item['amount'][0].encode('utf-8')])
        return item

items.py:

import scrapy
from scrapy.item import Item, Field

class PropubItem(scrapy.Item):
    payee = scrapy.Field()
    link = scrapy.Field()
    city = scrapy.Field()
    state = scrapy.Field()
    company = scrapy.Field()
    amount =  scrapy.Field()
    pass

CSV 输出:

【问题讨论】:

    标签: python python-2.7 pagination web-scraping scrapy


    【解决方案1】:

    需要修复多个问题:

    • 使用start_requests() 方法代替list_iterator()
    • 这里缺少%

      yield Request('https://projects.propublica.org/docdollars/search?page=%d' % i, callback=self.parse)
      #                                                                 HERE^
      
    • 您不需要CrawlSpider,因为您通过start_requests() 提供分页链接 - 使用常规scrapy.Spider
    • 如果 XPath 表达式可以按类属性匹配单元格,则更可靠

    固定版本:

    import scrapy
    
    from propub.items import PropubItem
    
    
    class propubSpider(scrapy.Spider):
        name = 'prop$'
        allowed_domains = ['projects.propublica.org']
        max_pages = 40
    
        def start_requests(self):
            for i in range(self.max_pages):
                yield scrapy.Request('https://projects.propublica.org/docdollars/search?page=%d' % i, callback=self.parse)
    
        def parse(self, response):
            for sel in response.xpath('//*[@id="payments_list"]//tr[@data-payment-id]'):
                item = PropubItem()
                item['payee'] = sel.xpath('td[@class="name_and_payee"]/a[last()]/text()').extract()
                item['link'] = sel.xpath('td[@class="name_and_payee"]/a[1]/@href').extract()
                item['city'] = sel.xpath('td[@class="city"]/text()').extract()
                item['state'] = sel.xpath('td[@class="state"]/text()').extract()
                item['company'] = sel.xpath('td[@class="company"]/text()').extract()
                item['amount'] = sel.xpath('td[@class="amount"]/text()').extract()
                yield item
    

    【讨论】:

    • 感谢亚历克西!我运行了固定版本,CSV 输出是来自一般搜索的 13 行正确项目(与状态特定的结果相比,我假设“start_urls”已过滤)。此外,我收到以下错误(请参阅“编辑:新错误”。因此,看起来我需要修复三个方面:1. 重新过滤状态特定结果的开始“url”,2. 迭代 25每页行数,3. 对所有 40 页进行分页。有什么想法吗?
    • “编辑:错误消息:”“文件“C:\Users\Anaconda2\propub\propub\pipelines.py”,第 22 行,在 process_item item['amount'][0].encode ('utf-8')]) exceptions.IndexError: list index out of range"
    • @EricJohn 好的,我已经更新了修复 XPath 表达式的 parse() 回调,检查一下。
    猜你喜欢
    • 2020-10-14
    • 1970-01-01
    • 2012-10-11
    • 2023-04-07
    • 1970-01-01
    • 2016-07-09
    • 2018-02-22
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多