【发布时间】:2017-01-23 22:52:06
【问题描述】:
我用 Python 编写了一个刮板,它从 futbin.com 网站上刮取玩家数据并将其写入 .csv 文件。我收到以下错误,发生在第 214 页 www.futbin.com/17/player/214。完整追溯:
Traceback (most recent call last):
File "C:/Users/jona_/PycharmProjects/untitled2/futbin_scraper_2.py", line 94, in <module>
writer.writerows([prices_attributes])
File "C:\Program Files\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u015f' in position 145: character maps to <undefined>
我怀疑这是因为页面上的这条数据:“Beşiktaş JK”(以及其他类似的数据)。我猜奇怪的 's' 字符对于 Windows 控制台是不可读的。我试过改变我的控制台编码。它当前设置为 utf-8,我使用以下方法进行了检查:
$import sys
$print(sys.stdin.encoding)
output: utf-8
>>> print(sys.stdout.encoding)
output: cp437
我也尝试使用set PYTHONIOENCODING=utf-16 命令将其设置为utf-16,并且我已经安装了win-unicode-console 包,但它并没有解决我的问题。为了完整起见,我将在下面发布整个脚本。
当我添加league = html_tree.xpath('//td/a[@href]/text()', smart_strings=False) 行时,问题开始出现。它从页面左侧的“信息”表中抓取数据。
还有一些关于 unicode 错误的其他问题,老实说,我在这里尝试了所有我有能力理解的解决方案。
我在 windows 10 上使用 jetbrains pycharm 社区版 IDE,python 3.5。
任何帮助将不胜感激
#
# This programme fetches price data and player attributes from the FIFA 17 Ultimate Team Market
# And writes them into a .csv file.
import csv
import requests
from lxml import html
import time
import os.path
import sys
#
# This creates a .csv file in a pre-specified directory to write the player data into
# Change: save_path and name_of_file
save_path = 'D:/Msc Finance/Thesis/Futbin Data/'
name_of_file = ("futbin_data")
completeName = os.path.join(save_path, name_of_file+".csv")
outfile = open(completeName, "w", newline='', )
#
# This generates a list of futbin.com URLs to feed into the script
# Change: integers in range() to specify the amount of futbin.com player pages to parse
amount_of_players = 16300
list_of_urls = []
for i in list(range(16300)):
id = i+1
url = "https://www.futbin.com/17/player/{0}".format(id)
list_of_urls.append(url)
#
# This loop finds all the player data from each url in list_of_urls and stores them into a list
for url in list_of_urls:
responses = requests.get(url)
html_tree = html.fromstring(responses.content)
name = html_tree.xpath('//span[@class = "header_name"]/text()', smart_strings=False)
prices = html_tree.xpath('//span[@class ="bin_text"]/text()', smart_strings=False)
attributes = html_tree.xpath('//td[@class ="table-row-text"]/text()', smart_strings=False)
league = html_tree.xpath('//td/a[@href]/text()', smart_strings=False)
position = html_tree.xpath('//div[@class ="pcdisplay-pos"]/text()', smart_strings=False)
rating = html_tree.xpath('//div[@class ="pcdisplay-rat"]/text()', smart_strings=False)
pace = html_tree.xpath('//div[@class ="pcdisplay-ovr1"]/text()', smart_strings=False)
shot = html_tree.xpath('//div[@class ="pcdisplay-ovr2"]/text()', smart_strings=False)
passing = html_tree.xpath('//div[@class ="pcdisplay-ovr3"]/text()', smart_strings=False)
dribble = html_tree.xpath('//div[@class ="pcdisplay-ovr4"]/text()', smart_strings=False)
defense = html_tree.xpath('//div[@class ="pcdisplay-ovr5"]/text()', smart_strings=False)
physique = html_tree.xpath('//div[@class ="pcdisplay-ovr6"]/text()', smart_strings=False)
# This merges all the player data together into one big list
prices_attributes = prices + attributes + league + position + rating + pace + shot + passing + dribble + defense + \
physique + name
# This removes all instances of \n from the big list
prices_attributes = [i.replace('\n', '') for i in prices_attributes]
# This removes all blank spaces from the big list
prices_attributes = [i.replace(' ', '') for i in prices_attributes]
# In some instances the '//td[@class ="table-row-text"]/text()' Xpath from attributes returns an extra empty element
# This 'if' statement removes the extra element to ensure all the columns in the .cvs file still align properly
if len(prices_attributes) > 40:
prices_attributes.pop(25)
prices_attributes.pop(30)
#
# This removes all the remaining empty elements from the big list. Not(12,13,14,24,25,26) because:
# Index numbers shift dynamically as the script removes elements from the list
if prices_attributes:
prices_attributes.pop(11)
prices_attributes.pop(11)
prices_attributes.pop(11)
prices_attributes.pop(20)
prices_attributes.pop(20)
prices_attributes.pop(20)
# Some URLs from list_of_urls no longer exist. These URLs yield empty lists: []
# The 'if' statement below makes sure only non-empty lists are written to the Excel file
if prices_attributes:
writer = csv.writer(outfile)
writer.writerows([prices_attributes])
# This fixes the delay between queries to 0.1 seconds
time.sleep(0.1)
# This prints the loop's % progress into the Python Console
sys.stdout.write("\r%d%%" % ((100/amount_of_players)*(list_of_urls.index(url)+1)))
sys.stdout.flush()
【问题讨论】:
-
(1) 提供整个追溯 (2)
sys.stdout.encoding可能比sys.stdin.encoding更相关;请提供它 -
>>> import sys >>> print(sys.stdout.encoding) 输出:cp437
-
回溯(最近一次调用最后一次):文件“C:/Users/jona_/PycharmProjects/untitled2/futbin_scraper_2.py”,第 94 行,在
writer.writerows([prices_attributes]) 文件中“C:\Program Files\Anaconda3\lib\encodings\cp1252.py”,第 19 行,编码返回 codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u015f' 在位置 145:字符映射到 -
为了便于阅读,还在正文中进行了编辑
-
Traceback 显示它正在尝试使用
cp1252进行编码,它涵盖了许多“奇怪”字符,但不是您数据中的字符。您需要找出它插入cp1252的位置并改为插入utf8。
标签: python excel unicode web-scraping lxml