如何在不使用集合的情况下从 python 列表中删除重复的单词？答案

【问题标题】：How do I remove duplicate words from a list in python without using sets?如何在不使用集合的情况下从 python 列表中删除重复的单词？
【发布时间】：2015-06-01 11:29:59
【问题描述】：

我有以下几乎适用于我的 python 代码（我非常接近！）。我有正在打开的莎士比亚戏剧的文本文件：原文文件：

“但是从窗外透进来的光线很柔和

这是东方，朱丽叶是太阳

升起美丽的太阳，杀死嫉妒的月亮

谁已经病入膏肓了”

我写的代码的结果是这样的：

['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and', '休息'，'东方'，'嫉妒'，'公平'，'悲伤'，'是'，'是'，'是'，'杀'， '光'，'月亮'，'苍白'，'生病'，'柔软'，'太阳'，'太阳'，'the'，'the'，'the'， '通过', '什么', '窗口', 'with', '那边']

所以这几乎就是我想要的：它已经在一个按我想要的方式排序的列表中，但是如何删除重复的单词？我正在尝试创建一个新的 ResultsList 并将单词附加到它，但它给了我上述结果而没有摆脱重复的单词。如果我“打印 ResultsList”，它只会输出大量单词。我现在拥有它的方式很接近，但我想摆脱额外的“and's”、“is's”、“sun's”和“the's”....我想保持简单并使用 append()，但是我不确定如何让它工作。我不想对代码做任何疯狂的事情。为了删除重复的单词，我从代码中遗漏了什么简单的东西？

fname = raw_input("Enter file name: ")  
fhand = open(fname)
NewList = list()      #create new list
ResultList = list()    #create new results list I want to append words to

for line in fhand:
    line.rstrip()       #strip white space
    words = line.split()    #split lines of words and make list
        NewList.extend(words)   #make the list from 4 lists to 1 list

    for word in line.split():   #for each word in line.split()
        if words not in line.split():    #if a word isn't in line.split
            NewList.sort()             #sort it
            ResultList.append(words)   #append it, but this doesn't work.


print NewList
#print ResultList (doesn't work the way I want it to)

【问题讨论】：

可以用字典吗？
为什么不使用OrderedSet (stackoverflow.com/questions/1653970/…)？

标签： python list duplicates

【解决方案1】：

mylist = ['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']
newlist = sorted(set(mylist), key=lambda x:mylist.index(x))
print(newlist)
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']

newlist 包含来自mylist 的一组唯一值的列表，按mylist 中每个项目的索引排序。

【讨论】：

谢谢。我想知道我是否可以在不使用套装的情况下做到这一点。与列表主题保持一致....？
你为什么要避免设置？它们是独特会员的理想选择。
这就是我在测试时尝试在这里和那里保存一些击键的结果。感谢您指出。 :)
我不想永远避免套路。我只是 python 的新手，想在我理解基础知识的地方取得进步，然后转向更高级的东西。但我很欣赏你的代码。我可以在未来实施。再次感谢！

【解决方案2】：

以下功能可能会有所帮助。

   def remove_duplicate_from_list(temp_list):
        if temp_list:
            my_list_temp = []
            for word in temp_list:
                if word not in my_list_temp:
                    my_list_temp.append(word)
            return my_list_temp
        else: return []

【讨论】：

【解决方案3】：

您的代码确实存在一些逻辑错误。我已修复它们，希望对您有所帮助。

fname = "stuff.txt"
fhand = open(fname)
AllWords = list()      #create new list
ResultList = list()    #create new results list I want to append words to

for line in fhand:
    line.rstrip()   #strip white space
    words = line.split()    #split lines of words and make list
    AllWords.extend(words)   #make the list from 4 lists to 1 list

AllWords.sort()  #sort list

for word in AllWords:   #for each word in line.split()
    if word not in ResultList:    #if a word isn't in line.split            
        ResultList.append(word)   #append it.


print(ResultList)

在 Python 3.4 上测试，没有导入。

【讨论】：

谢谢！这就是我一直在寻找的。我想让它保持“简单”，我现在明白我做错了什么。再次感谢您，非常感谢。
如果文件中出现相同的单词但大小写不同，这将报告重复的单词。例如如果“is”和“Is”出现在文件中，您将在最终列表中获得“is”和“Is”。由于您的示例数据中没有这种情况，我认为这不是太大的问题。

【解决方案4】：

这应该可行，它遍历列表并将元素添加到新列表中，如果它们与添加到新列表中的最后一个元素不同。

def unique(lst):
    """ Assumes lst is already sorted """
    unique_list = []
    for el in lst:
        if el != unique_list[-1]:
            unique_list.append(el)
    return unique_list

您也可以使用类似的 collections.groupby

from collections import groupby

# lst must already be sorted 
unique_list = [key for key, _ in groupby(lst)]

【讨论】：

谢谢，我以后可以用这个。

【解决方案5】：

使用set 的一个很好的替代方法是使用字典。 collections 模块包含一个名为 Counter 的类，它是一个专门的字典，用于计算每个键被看到的次数。使用它你可以做这样的事情：

from collections import Counter

wordlist = ['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and',
            'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is',
            'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun',
            'the', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']

newlist = sorted(Counter(wordlist), 
                 key=lambda w: w.lower())  # case insensitive sort
print(newlist)

输出：

['already', 'and', 'Arise', 'breaks', 'But', 'east', 'envious', 'fair',
 'grief', 'is', 'It', 'Juliet', 'kill', 'light', 'moon', 'pale', 'sick',
 'soft', 'sun', 'the', 'through', 'what', 'Who', 'window', 'with', 'yonder']

【讨论】：

非常感谢。我不知道计数器。

【解决方案6】：

您的代码有问题。我想你的意思是：

for word in line.split():   #for each word in line.split()
    if words not in ResultList:    #if a word isn't in ResultList

【讨论】：

谢谢，我把它弄混了。

【解决方案7】：

使用简单的旧列表。几乎可以肯定不如Counter 高效。

fname = raw_input("Enter file name: ")  

Words = []
with open(fname) as fhand:
    for line in fhand:
        line = line.strip()
        # lines probably not needed
        #if line.startswith('"'):
        #    line = line[1:]
        #if line.endswith('"'):
        #    line = line[:-1]
        Words.extend(line.split())

UniqueWords = []
for word in Words:
    if word.lower() not in UniqueWords:
        UniqueWords.append(word.lower())

print Words
UniqueWords.sort()
print UniqueWords

这始终检查单词的小写版本，以确保相同的单词但在不同的大小写配置中不会被计为 2 个不同的单词。

我添加了检查以删除文件开头和结尾的双引号，但如果它们不存在于实际文件中。这些行可以忽略。

【讨论】：

谢谢，这是一个更好的字母化方法。
你也可以用line = line.strip().strip('"')检查/删除引号

【解决方案8】：

这应该可以完成工作：

fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
    line = line.rstrip()
    words = line.split()
    for word in words:
        if word not in lst:
            lst.append(word)
lst.sort()
print(lst)

【讨论】：