Python 中的特殊国家字符不会 .split()答案

【问题标题】：Special national characters won't .split() in PythonPython 中的特殊国家字符不会 .split()
【发布时间】：2014-06-12 20:30:52
【问题描述】：

从文本文件中读取特殊国家字符时，我在 Python 中遇到问题。

with open("../Data/DKsnak.txt") as f:
    content = f.readlines()

str1 = content[0]
print "string:",str1

lst1 = str1.split()
print "list:",lst1

输出如下：

string: Udtræk fra observatør på årstal
list: ['Udtr\xc3\xa6k', 'fra', 'observat\xc3\xb8r', 'p\xc3\xa5', '\xc3\xa5rstal']

第一行符合预期，包括特殊的丹麦字符。但是它们无法在被拆分成字符串时幸存下来。我用编解码器和 unicode 尝试了各种技巧，但找不到灵丹妙药。

请任何人建议我如何将这些单词放入列表中，以便我可以使用它们。

最好的问候马丁

运行： Python 2.7.5（默认，2014 年 2 月 19 日，13:47:28） [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] 在 linux2 上

【问题讨论】：

【解决方案1】：

使用前面提到的 for 循环，如果您希望它们位于同一行：

for i in len(list1):

    string += list1[i] + ' '

print(string)

【讨论】：

【解决方案2】：

来自https://docs.python.org/2.7/howto/unicode.html：

import codecs
f = codecs.open('unicode.rst', encoding='utf-8')

所以你得到 unicode 并且可以拆分。

【讨论】：

【解决方案3】：

您的代码很好。 python 只是像这样存储它的特殊字符。如果你打印出你的文本，你仍然会得到原始字符串：

s = 'Udtræk fra observatør på årstal'
s = s.split()

for i in s:
    print i

[OUTPUT]         #all fine
Udtræk
fra
observatør
på
årstal

【讨论】：