【问题标题】:Unicode Error in windows when writing to microsoft Excel写入 Microsoft Excel 时 Windows 出现 Unicode 错误
【发布时间】:2017-01-23 22:52:06
【问题描述】:

我用 Python 编写了一个刮板,它从 futbin.com 网站上刮取玩家数据并将其写入 .csv 文件。我收到以下错误,发生在第 214 页 www.futbin.com/17/player/214。完整追溯:

 Traceback (most recent call last):
  File "C:/Users/jona_/PycharmProjects/untitled2/futbin_scraper_2.py", line 94, in <module>
    writer.writerows([prices_attributes])
  File "C:\Program Files\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u015f' in position 145: character maps to <undefined>

我怀疑这是因为页面上的这条数据:“Beşiktaş JK”(以及其他类似的数据)。我猜奇怪的 's' 字符对于 Windows 控制台是不可读的。我试过改变我的控制台编码。它当前设置为 utf-8,我使用以下方法进行了检查:

$import sys
$print(sys.stdin.encoding)
output: utf-8

>>> print(sys.stdout.encoding)
output: cp437

我也尝试使用set PYTHONIOENCODING=utf-16 命令将其设置为utf-16,并且我已经安装了win-unicode-console 包,但它并没有解决我的问题。为了完整起见,我将在下面发布整个脚本。

当我添加league = html_tree.xpath('//td/a[@href]/text()', smart_strings=False) 行时,问题开始出现。它从页面左侧的“信息”表中抓取数据。

还有一些关于 unicode 错误的其他问题,老实说,我在这里尝试了所有我有能力理解的解决方案。

我在 windows 10 上使用 jetbrains pycharm 社区版 IDE,python 3.5。

任何帮助将不胜感激


#
# This programme fetches price data and player attributes from the FIFA 17 Ultimate Team Market
# And writes them into a .csv file.

import csv
import requests
from lxml import html
import time
import os.path
import sys

#
# This creates a .csv file in a pre-specified directory to write the player data into
# Change: save_path and name_of_file
save_path = 'D:/Msc Finance/Thesis/Futbin Data/'
name_of_file = ("futbin_data")
completeName = os.path.join(save_path, name_of_file+".csv")
outfile = open(completeName, "w", newline='', )

#
# This generates a list of futbin.com URLs to feed into the script
# Change: integers in range() to specify the amount of futbin.com player pages to parse
amount_of_players = 16300
list_of_urls = []
for i in list(range(16300)):
    id = i+1
    url = "https://www.futbin.com/17/player/{0}".format(id)
    list_of_urls.append(url)

#
# This loop finds all the player data from each url in list_of_urls and stores them into a list

for url in list_of_urls:
    responses = requests.get(url)
    html_tree = html.fromstring(responses.content)
    name = html_tree.xpath('//span[@class = "header_name"]/text()', smart_strings=False)
    prices = html_tree.xpath('//span[@class ="bin_text"]/text()', smart_strings=False)
    attributes = html_tree.xpath('//td[@class ="table-row-text"]/text()', smart_strings=False)
    league = html_tree.xpath('//td/a[@href]/text()', smart_strings=False)
    position = html_tree.xpath('//div[@class ="pcdisplay-pos"]/text()', smart_strings=False)
    rating = html_tree.xpath('//div[@class ="pcdisplay-rat"]/text()', smart_strings=False)
    pace = html_tree.xpath('//div[@class ="pcdisplay-ovr1"]/text()', smart_strings=False)
    shot = html_tree.xpath('//div[@class ="pcdisplay-ovr2"]/text()', smart_strings=False)
    passing = html_tree.xpath('//div[@class ="pcdisplay-ovr3"]/text()', smart_strings=False)
    dribble = html_tree.xpath('//div[@class ="pcdisplay-ovr4"]/text()', smart_strings=False)
    defense = html_tree.xpath('//div[@class ="pcdisplay-ovr5"]/text()', smart_strings=False)
    physique = html_tree.xpath('//div[@class ="pcdisplay-ovr6"]/text()', smart_strings=False)

    # This merges all the player data together into one big list
    prices_attributes = prices + attributes + league + position + rating + pace + shot + passing + dribble + defense + \
                        physique + name

    # This removes all instances of \n from the big list
    prices_attributes = [i.replace('\n', '') for i in prices_attributes]

    # This removes all blank spaces from the big list
    prices_attributes = [i.replace(' ', '') for i in prices_attributes]

    # In some instances the '//td[@class ="table-row-text"]/text()' Xpath from attributes returns an extra empty element
    # This 'if' statement removes the extra element to ensure all the columns in the .cvs file still align properly
    if len(prices_attributes) > 40:
        prices_attributes.pop(25)
        prices_attributes.pop(30)

    #
    # This removes all the remaining empty elements from the big list. Not(12,13,14,24,25,26) because:
    # Index numbers shift dynamically as the script removes elements from the list
    if prices_attributes:
        prices_attributes.pop(11)
        prices_attributes.pop(11)
        prices_attributes.pop(11)
        prices_attributes.pop(20)
        prices_attributes.pop(20)
        prices_attributes.pop(20)

    # Some URLs from list_of_urls no longer exist. These URLs yield empty lists: []
    # The 'if' statement below makes sure only non-empty lists are written to the Excel file
    if prices_attributes:
        writer = csv.writer(outfile)
        writer.writerows([prices_attributes])

    # This fixes the delay between queries to 0.1 seconds
    time.sleep(0.1)

    # This prints the loop's % progress into the Python Console
    sys.stdout.write("\r%d%%" % ((100/amount_of_players)*(list_of_urls.index(url)+1)))
    sys.stdout.flush()

【问题讨论】:

  • (1) 提供整个追溯 (2) sys.stdout.encoding 可能比 sys.stdin.encoding 更相关;请提供它
  • >>> import sys >>> print(sys.stdout.encoding) 输出:cp437
  • 回溯(最近一次调用最后一次):文件“C:/Users/jona_/PycharmProjects/untitled2/futbin_scraper_2.py”,第 94 行,在 writer.writerows([prices_attributes]) 文件中“C:\Program Files\Anaconda3\lib\encodings\cp1252.py”,第 19 行,编码返回 codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u015f' 在位置 145:字符映射到
  • 为了便于阅读,还在正文中进行了编辑
  • Traceback 显示它正在尝试使用 cp1252 进行编码,它涵盖了许多“奇怪”字符,但不是您数据中的字符。您需要找出它插入cp1252 的位置并改为插入utf8

标签: python excel unicode web-scraping lxml


【解决方案1】:

从您的回溯和代码中,您正在使用 Python 3 并使用默认编码打开输出文件。 locale.getpreferredencoding(False) 是默认使用的,在您的情况下是 cp1252。请改用utf-8-sig,但不要使用utf8。 Excel 假定没有字节顺序标记 (BOM) 签名的文件也是默认编码。

在您的代码中,使用:

outfile = open(completeName,'w',newline='',encoding='utf-8-sig')

【讨论】:

  • 这解决了问题。非常感谢。
猜你喜欢
  • 2018-02-18
  • 1970-01-01
  • 1970-01-01
  • 2018-04-02
  • 2015-09-28
  • 2013-07-19
  • 1970-01-01
  • 1970-01-01
  • 2018-09-27
相关资源
最近更新 更多