输入中不支持的字符 (Python 2.7.9)答案

【问题标题】：Unsupported characters in input (Python 2.7.9)输入中不支持的字符 (Python 2.7.9)
【发布时间】：2015-01-03 22:01:21
【问题描述】：

来自新手的一个小问题。我正在尝试做一个小功能，它随机化文本的内容。

#-*- coding: utf-8 -*-
import random

def glitch(text):
    new_text = ['']
    for x in text:
        new_text.append(x)
        random.shuffle(new_text)
    return ''.join(new_text)

你可以看到它非常简单，当输入一个简单的字符串时输出，例如“嘿，你好吗？”将产生一个预测的随机句子。但是，当我尝试粘贴类似于此的内容时：

打印故障（'Iàäï††n$§&0ñŒ≥Q¶µù`o¢y”—œº'）

...Python 2.7.9 返回 'Unsupported characters in input' -- 我环顾了论坛，尽我所能尝试了一些方法，因为我对一般编码仍然很陌生，但无济于事。

有什么建议吗？

谢谢。

【问题讨论】：

在 2.7.5 中对我来说效果很好，无论是从脚本内打印还是导入到控制台后。
我是否可能遗漏了一些首选项，或者可能必须下载一些软件包才能使用此类输入？我在 Mac OSX 10.10.1 上——我多次尝试更改三个选项（语言环境、utf-8、无）的首选项，但似乎没有任何效果。

标签： python-2.7 utf-8

【解决方案1】：

#-*- coding: utf-8 -*-
import random

def glitch(text):

    new_text = ['']
    for x in text:
        new_text.append(x)
        random.shuffle(new_text)
    return ''.join(new_text)

print (glitch(u'Iàäï†n$§&0ñŒ≥Q¶µù`o¢y”—œº'))

这应该可以，通过我自己的快速谷歌搜索，我发现，你必须在前面加上字母“u”，才能将以下文本标记为 unicode。

来源：Unsupported characters in input

【讨论】：

谢谢，但前面的 u 似乎没有任何区别。

【解决方案2】：

您的问题是 Python 2.x - 不是您特定版本的 Python 2。Python 2.x 使用 ascii 而不是 Unicode 编码（在 Python 3 中更改），并且您的字符串（likley）编码为 @987654324 @。请参阅以下内容：

import chardet
text = 'Iàäï†n$§&0ñŒ≥Q¶µù`o¢y”—œº'
print chardet.detect(text)['encoding'] # prints utf-8

如果您下载 Python 3.X，您的问题可能会得到解决，since UTF-8 can handle any Unicode code point。

如果您有兴趣 - 或者对于未来的 2.x 用户 - 您可以执行以下操作。

def glitch(text):
    new_text = []
    for x in text:
        new_text.append(x)
    random.shuffle(new_text) #note you should just shuffle once - not every iteration.
    new_line = ''.join(new_text) # this line is where your encoding moves from `utf-8` to `ascii`
    # this becomes `ascii` because of the empty string you use to join your list.  it defaults to `ascii`
    # if you tried to make it `unicode` by doing `u''.join(list)` you would get a `UnicodeDecodeError`
    return new_line.decode("ascii", "ignore").encode("utf-8") # note the [ignore][2].  it bypasses encoding errors.
    # now your code will run and return a string of utf-8 characters 
    # (to which we encode new_line, and which is the default encoding of a string anytime you `decode()` it.)
    # note that you will return a shorter string, because (again) `ascii` can only represent 
    # 128 characters by default, whereas some of your `utf-8` string is represented by 
    # characters b/w 129 & 255.

我希望这会有所帮助并且有意义。网上有很多资料讨论这个问题（包括我自己的多个问题——for example :)）

【讨论】：

print chardet.detect(text)['encoding'] 显示“UTF-8”，因为您的文本字符串已保存并编码为 UTF-8。如果我对文件进行不同的编码，我可以制作这个节目windows-1252。
很少对我投反对票，因为您的回答对其他用户来说太误导了。 Python 2.x 在 Unicode 支持方面同样强大。您的 glitch() 方法包含许多谬误，这将阻止用户正确理解 Unicode 支持