【问题标题】:Scrapy - Non-ascii-character declared, but no encoding declaredScrapy - 声明了非 ASCII 字符,但没有声明编码
【发布时间】:2014-03-04 23:04:50
【问题描述】:

我正在尝试从该站点上抓取一些基本数据,以作为练习以了解有关 scrapy 的更多信息,并作为大学项目的概念证明: http://steamdb.info/sales/

当我使用 scrapy shell 时,我能够使用以下 XPath 获得我想要的信息:

sel.xpath(‘//tbody/tr[1]/td[2]/a/text()’).extract()

应该返回表格第一行的游戏名称,在结构中:

<tbody>
     <tr>
          <td></td>
          <td><a>stuff I want here</a></td>
...

确实如此,在 shell 中。

但是,当我尝试将其放入蜘蛛 (steam.py) 中时:

1 from scrapy.spider import BaseSpider
2 from scrapy.selector import HtmlXPathSelector
3 from steam_crawler.items import SteamItem
4 from scrapy.selector import Selector
5 
6 class SteamSpider(BaseSpider):
7     name = "steam"
8     allowed_domains = ["http://steamdb.info/"]
9     start_urls = ['http://steamdb.info/sales/?displayOnly=all&category=0&cc=uk']
10     def parse(self, response):
11         sel = Selector(response)
12         sites = sel.xpath("//tbody")
13         items = []
14         count = 1
15         for site in sites:
16             item = SteamItem()
17             item ['title'] = sel.xpath('//tr['+ str(count) +']/td[2]/a/text()').extract().encode('utf-8')
18             item ['price'] = sel.xpath('//tr['+ str(count) +']/td[@class=“price-final”]/text()').extract().encode('utf-8')
19             items.append(item)
20             count = count + 1
21         return items

我收到以下错误:

    ricks-mbp:steam_crawler someuser$ scrapy crawl steam -o items.csv -t csv
Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.20.0', 'scrapy')
  File "build/bdist.macosx-10.9-intel/egg/pkg_resources.py", line 492, in run_script

  File "build/bdist.macosx-10.9-intel/egg/pkg_resources.py", line 1350, in run_script
    for name in eagers:
  File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/cmdline.py", line 89, in _run_print_help
    func(*a, **kw)
  File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/commands/crawl.py", line 47, in run
    crawler = self.crawler_process.create_crawler()
  File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/crawler.py", line 87, in create_crawler
    self.crawlers[name] = Crawler(self.settings)
  File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/crawler.py", line 25, in __init__
    self.spiders = spman_cls.from_crawler(self)
  File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/spidermanager.py", line 35, in from_crawler
    sm = cls.from_settings(crawler.settings)
  File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/spidermanager.py", line 31, in from_settings
    return cls(settings.getlist('SPIDER_MODULES'))
  File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/spidermanager.py", line 22, in __init__
    for module in walk_modules(name):
  File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/utils/misc.py", line 68, in walk_modules
    submod = import_module(fullpath)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/xxx/scrape/steam/steam_crawler/spiders/steam.py", line 18
SyntaxError: Non-ASCII character '\xe2' in file /xxx/scrape/steam/steam_crawler/spiders/steam.py on line 18, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

我有一种感觉,我需要做的就是以某种方式告诉scrapy,这些字符将遵循 utf-8 而不是 ascii - 因为有英镑等。但据我所知,它应该从在本网站的情况下,其抓取的页面头部是:

<meta charset="utf-8">

这让我很困惑!任何不是scrapy文档本身的见解/阅读我也会感兴趣!

【问题讨论】:

    标签: python encoding scrapy


    【解决方案1】:

    似乎您使用的是 而不是双引号"

    顺便说一句,循环所有表行的更好做法是:

    for tr in sel.xpath("//tr"):
        item = SteamItem()
        item ['title'] = tr.xpath('td[2]/a/text()').extract()
        item ['price'] = tr.xpath('td[@class="price-final"]/text()').extract()
        yield item
    

    【讨论】:

    • 这看起来简单多了,就像做梦一样。你是怎么学会scrapy的?书籍/教程?
    猜你喜欢
    • 2019-06-20
    • 1970-01-01
    • 1970-01-01
    • 2015-10-08
    • 1970-01-01
    • 1970-01-01
    • 2015-02-01
    • 2011-11-25
    • 2023-02-20
    相关资源
    最近更新 更多