【发布时间】:2018-10-02 09:52:49
【问题描述】:
#!/usr/bin/env python3
import glob
import xml.etree.ElementTree as ET
filenames = glob.glob("C:\\Users\\####\\Desktop\\BNC2\\[A00-ZZZ]*.xml")
out_lines = []
for filename in filenames:
with open(filename, 'r', encoding="utf-8") as content:
tree = ET.parse(content)
root = tree.getroot()
for w in root.iter('w'):
lemma = w.get('hw')
pos = w.get('pos')
tag = w.get('c5')
out_lines.append(w.text + "," + lemma + "," + pos + "," + tag)
with open("C:\\Users\\####\\Desktop\\bnc.txt", "w") as out_file:
for line in out_lines:
line = bytes(line, 'utf-8').decode('utf-8', 'ignore')
out_file.write("{}\n".format(line))
给出错误:
UnicodeEncodeError:“charmap”编解码器无法在位置 0 编码字符“\u2192”:字符映射到未定义
我以为这条线会解决这个问题...
line = bytes(line, 'utf-8').decode('utf-8', 'ignore')
【问题讨论】:
-
你试过
open("C:\\Users\\####\\Desktop\\bnc.txt", "w", encoding='utf8')吗?? -
请发布整个回溯。 Python 告诉你哪条线路有问题... 向前支付!不要让我们猜测。
-
line = bytes(line, 'utf-8').decode('utf-8', 'ignore')所做的只是编码为 utf-8 并再次解码。你得到原来的字符串。我怀疑的问题是当您尝试写入 ascii 文件时。