【问题标题】:Making web scraper yield more information让网络爬虫产生更多信息
【发布时间】:2013-07-03 06:55:57
【问题描述】:

我编写了一个简单的网络爬虫,它从csv 文件中的地址和邮政编码中提取街道名称和该街道的序列号。我想将街道名称、序列号和邮政编码保存在一个新的 csv 文件中,但我不知道如何将邮政编码传递给我的 parse() 方法,因为我从 @ 调用蜘蛛987654325@通过:

scrapy crawl Geospider -o Scraped_data.csv -t csv

这是我的蜘蛛(代码实际上不起作用,因为我正在抓取的页面需要登录名和密码,我不会提供我的,但任何人都可以在 http://download.kortforsyningen.dk//content/opret-mig-som-bruger 上注册为用户,它是不是我的问题的一部分):

from scrapy.spider import BaseSpider
from scrapy.selector import XmlXPathSelector
from scrapy.item import Item, Field
import csv

class Road(Item):
    RoadNum = Field()
    RoadName = Field()
    PostNum = Field()

class Geospider(BaseSpider):
    name = 'Geospider'
    allowed_domains = ["http://kortforsyningen.kms.dk/"]

    def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
        csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
        for row in csv_reader:
            yield [unicode(cell, 'utf-8') for cell in row]

    filename = 'AddressesAndZipcodes.csv'
    reader = unicode_csv_reader(open(filename))
    start_urls = []
    ZipCode = []
    for row in reader:
        Address = row[0]
        Zip = row[1]
        start_urls.append('http://kortforsyningen.kms.dk/service?ServiceName=geoV&soegemetode=0&vejnavn=%s&kommunepost=%s&format=XML&max_hits=10&login=xxx&password=xxx' % (Address, ZipCode))
        ZipCode.append(Zip)

    def parse(self, response):
        xxs = XmlXPathSelector(response)
        sites = xxs.select('//dokument/forekomst')
        items = Road()
        items['RoadNum'] = sites.select("vejkode/text()").extract()
        items['RoadName'] = sites.select("vejnavn/text()").extract()
        items['PostNum'] = ZipCode
        yield items, ZipCode

关于如何将邮政编码传递给Parse() 以便邮政编码与其他结果一起保存有什么想法?

谢谢

【问题讨论】:

    标签: python web-scraping scrapy


    【解决方案1】:

    覆盖start_requests,在那里读取csv文件并在request.meta中传递zip将为您工作:

    from scrapy.http import Request
    from scrapy.spider import BaseSpider
    from scrapy.selector import XmlXPathSelector
    from scrapy.item import Item, Field
    import csv
    
    
    class Road(Item):
        RoadNum = Field()
        RoadName = Field()
        PostNum = Field()
    
    
    def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
        csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
        for row in csv_reader:
            yield [unicode(cell, 'utf-8') for cell in row]
    
    
    class Geospider(BaseSpider):
        name = 'Geospider'
        allowed_domains = ["http://kortforsyningen.kms.dk/"]
        start_urls = []
    
        def start_requests(self):
            reader = unicode_csv_reader(open('AddressesAndZipcodes.csv'))
            for row in reader:
                address, zip_code = row[:2]
                url = 'http://kortforsyningen.kms.dk/service?ServiceName=geoV&soegemetode=0&vejnavn=%s&kommunepost=%s&format=XML&max_hits=10&login=xxx&password=xxx' % (address, zip_code)
    
                yield Request(url=url, meta={'zip_code': zip_code})
    
        def parse(self, response):
            xxs = XmlXPathSelector(response)
            sites = xxs.select('//dokument/forekomst')
    
            item = Road()
            item['RoadNum'] = sites.select("vejkode/text()").extract()
            item['RoadName'] = sites.select("vejnavn/text()").extract()
            item['PostNum'] = response.meta['zip_code']
    
            yield item
    

    希望对您有所帮助。

    【讨论】:

    猜你喜欢
    • 2017-12-28
    • 1970-01-01
    • 2018-08-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-02-04
    • 1970-01-01
    相关资源
    最近更新 更多