【问题标题】:Error while running scrapy web crawler运行scrapy网络爬虫时出错
【发布时间】:2014-07-01 04:08:32
【问题描述】:
import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            title = sel.xpath('a/text()').extract()
            link = sel.xpath('a/@href').extract()
            desc = sel.xpath('text()').extract()
            print title, link, desc

但是,当我尝试调用蜘蛛时,我收到以下错误消息:

[example] ERROR: Spider error processing <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
    Traceback (most recent call last):
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 368, in callback
        self._startRunCallbacks(result)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 464, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 551, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/Users/andy2/Documents/Python/tutorial/tutorial/spiders/example.py", line 18, in parse
        print title, link, desc
    exceptions.NameError: global name 'link' is not defined

我可以做些什么来使这段代码正常工作吗?

谁能帮帮我?

谢谢!!!

【问题讨论】:

    标签: python web-scraping scrapy scrapy-spider


    【解决方案1】:

    您需要实例化 Selector 并将 response 作为参数传递。此外,您的导入不正确。这是蜘蛛的固定版本:

    from scrapy.selector import Selector
    from scrapy.spider import Spider
    
    
    class ExampleSpider(Spider):
        name = "example"
        allowed_domains = ["dmoz.org"]
        start_urls = [
            "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
            "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
        ]
    
        def parse(self, response):
            sel = Selector(response)
            for li in sel.xpath('//ul/li'):
                title = li.xpath('a/text()').extract()
                link = li.xpath('a/@href').extract()
                desc = li.xpath('text()').extract()
                print title, link, desc
    

    【讨论】:

      最近更新 更多