Unicode 搜索不起作用答案

【问题标题】：Unicode search not workingUnicode 搜索不起作用
【发布时间】：2016-01-06 23:22:34
【问题描述】：

考虑一下。

# -*- coding: utf-8 -*-
data = "cdbsb \xe2\x80\xa6 abc"
print data 
#prints cdbsb … abc
              ^
print re.findall(ur"[\u2026]", data )

为什么re 找不到这个 unicode 字符？我已经查过了

\xe2\x80\xa6 === … === U+2026

【问题讨论】：

标签： python regex python-2.7 python-unicode

【解决方案1】：

我猜这个问题是因为data 是一个字节串。您可能将控制台编码为 utf-8 ，因此在打印字符串时，控制台会将字符串转换为 utf-8 然后显示它（您可以在 sys.stdout.encoding 中查看此内容）。因此你得到了这个角色 - … 。

但很可能re 不会为您执行此解码。

如果您将data 转换为utf-8 编码，则在使用re.findall 时将获得所需的结果。示例 -

>>> data = "cdbsb \xe2\x80\xa6 abc"
>>> print re.findall(ur"[\u2026]", data.decode('utf-8') )
[u'\u2026']

【讨论】：

您的意思是说print 自动执行此操作？而且它又不打印…
实际上它转换为控制台使用的编码 - sys.stdout.encoding .
@vks 我正在尝试为此找到一些参考。到现在我只能找到 - wiki.python.org/moin/PrintFails .
@vks 我相信这将是控制台进行解码/编码，而不是 Python
不知道这是否有帮助，但 if u"\u2026".encode('utf8') in data: print True 对我有用 ;)

【解决方案2】：

data 是 str 类型，包含十六进制值的 ASCII 字符。但是搜索词是 unicode 类型的。打印函数将默认值转换为sys.stdout.encoding。当我尝试按原样打印data 时，输出与data.decode('utf-8') 不同。我正在使用 Python 2.7

data = "cdbsb \xe2\x80\xa6 abc"
search = ur"[\u2026]"

print sys.stdout.encoding
## windows-1254

print data, type(data)
## cdbsb â€¦ abc <type 'str'>

print data.decode(sys.stdout.encoding)
## cdbsb â€¦ abc

print data.decode('utf-8')
## cdbsb … abc

print search, type(search)
## […] <type 'unicode'>

print re.findall(search, data.decode('utf-8'))
## [u'\u2026']

【讨论】：

【解决方案3】：

如果你通过nhahtdh提供的链接

Solving Unicode Problems in Python 2.7

您可以看到原始字符串在 bytes 中，我们正在搜索 unicode。所以它不应该起作用。

encode(): 让你从 Unicode → bytes

decode()：从字节中获取您 → Unicode

根据这些，我们可以通过两种方式解决它。

# -*- coding: utf-8 -*-
data = "cdbsb \xe2\x80\xa6 abc".decode("utf-8")  #convert to unicode
print data
print re.findall(ur"[\u2026]", data )
print re.findall(ur"[\u2026]", data )[0].encode("utf-8")  #compare with unicode byte string and then reconvert to bytes for print

data1 = "cdbsb \xe2\x80\xa6 abc"  #let it remain bytes
print data1
print re.findall(r"\xe2\x80\xa6", data1 )[0] #search for bytes

【讨论】：

就个人而言，我觉得没有理由使用第二种方法。在 Python 2 中，如果要处理字符数据，则应始终使用 unicode 类型。在正则表达式源代码中使用文字 Unicode 字符时，使用 str 可能会隐藏错误。

【解决方案4】：

另一种解决方案：

>>> data = "cdbsb \xe2\x80\xa6 abc"
>>> print data 
cdbsb … abc
>>> if u"\u2026".encode('utf8') in data: print True
... 
True
>>> if u"\u2026" in data.decode('utf8'): print True
... 
True

【讨论】：

我做了一个简短的评论，我无法复制它。不久之后我发现了一个错误并删除了评论。对于此处显示的错误信息几分钟，我深表歉意。