utf-8 在列表中搜索单词答案

【问题标题】：utf-8 search for word in listutf-8 在列表中搜索单词
【发布时间】：2018-04-06 00:02:56
【问题描述】：

我有一个由 utf-8 文件生成的查找列表

with open('stop_word_Tiba.txt') as f:
    newStopWords= list(itertools.chain( line.split() for line in f)) #save the file as list of lines
newStopWords1d=list(itertools.chain(*newStopWords)) # convert 2d list to 1d list

当我打开文件时，我看到里面有“الو”这个词。所以它在列表中，但列表现在看起来像 ['\xd8\xa7\xd9\x84\xd9\x88', '\xd8\xa3\xd9\x84\xd9\x88', '\xd8\xa7\xd9\x88\xd9\x83\xd9\x8a', '\xd8\xa7\xd9\x84', '\xd8\xa7\xd9\x87', '\xd8\xa3\xd9\x87', '\xd9\x87\xd9\x84\xd9\x88', '\ xd8\xa3\xd9\x88\xd9\x83\xd9\x8a', '\xd9\x88']

然后我想搜索特定单词是否在 newStopWords1d 单词'الو'是'\xd8\xa7\xd9\x84\xd9\x88'

word='الو'
for w in newStopWords1d:
    if word == w.encode("utf-8"):
        print 'found'

单词没找到，我试过了

    if word in newStopWords1d:
        print 'found'

但又没有看到这个词。这似乎是编码的问题，但我无法解决。你能帮帮我吗？

【问题讨论】：

这个文件是你自己写的吗，写str(my_list)？如果是这样，你能把文件扔掉，用更有用的格式写一个文件，比如 JSON，还是每行只写一个字符串？
如果您按原样使用此文件，您可能希望使用ast.literal_eval 将输入转换回字符串列表，然后搜索字符串列表而不是搜索通过字符串列表的字符串表示。但是，同样，如果完全可以正确地重写文件，请改为这样做。
在python2.7中，删除.encode("utf-8")，只使用if word == w:产生found
好的，这也是一个解决方案，谢谢

标签： python search utf-8

【解决方案1】：

值得一提的是，您使用的是 Python 2.7。

word='الو'
for w in newStopWords1d:
    if word == w.decode("utf-8"):
        print 'found'

更好的解决方案是使用 io 中的任何一个 open 函数

import io

with io.open('stop_word_Tiba.txt', encoding="utf-8") as f:
    ...

或codecs模块

import codecs

with codecs.open('stop_word_Tiba.txt', encoding="utf-8") as f:
    ...

由于 Python 2.7 中内置的 open 函数不支持指定编码。

【讨论】：

是的，Python 2.7 是正确的，问题在于将文件视为 utf-8。谢谢

【解决方案2】：

通过将打开文件语句编辑为解决的问题

with codecs.open("stop_word_Tiba.txt", "r", "utf-8") as f:
    newStopWords= list(itertools.chain( line.split() for line in f)) #save the file as list of lines
newStopWords1d=list(itertools.chain(*newStopWords))
    for w in newStopWords1d:
            if word.encode("utf-8") == w.encode("utf-8") :  
                      return 'found'

谢谢你..

【讨论】：

我相信if word.encode("utf-8") == w 就足够了。 w 已经被 utf-8 编码
如果文件以 utf-8 格式打开，实际上两者都不需要，我测试了两者都没有的运行，即 if word == w : return 'found' 并找到了这个词。谢谢