【问题标题】：Removing duplicate records in a file [duplicate]删除文件中的重复记录 [重复]
【发布时间】：2012-01-06 04:53:23
【问题描述】：

可能重复：
How might I remove duplicate lines from a file?

我有一个要删除的包含重复记录的文件。这是我尝试过的

import sys  

for line in sys.stdin:  
    line = line.rstrip()  
    line = line.split()  
    idlist = []   
    if idlist == []:  
        idlist = line[1]  
    else:  
    idlist.append(line[1])  
    print line[0], idlist

#没用

还有这个

for line in sys.stdin:  
    line = line.rstrip()  
    line = line.split()  
    lines_seen = set()  
    dup = line[1]  
    if dup not in lines_seen:  
        lines_seen = dup  
    else:  
        lines_seen.append(dup)  
    print line[0], lines_seen  
    
sys.stdin.close()

#也不起作用！

这就是输入的样子

BLE 1234
BLE 1223
LLE 3456
ELE 1223
BLE 4444
ELE 5555
BLE 4444

这就是我希望输出的样子

BLE 1234
BLE 1223
LLE 3456
BLE 4444
ELE 5555

谢谢！边缘

【问题讨论】：

您认为什么是“重复记录”？
为什么“BLE 1223”不在您想要的输出中？为什么“LLE 3456”和“ELE 1223”的顺序在所需输出中颠倒了？
重复记录在此示例中，我重点关注第二列，即“1223”和“4444”。

标签： python

【解决方案1】：

import fileinput

ss = '''BLE 1234
BLE 1223
LLE 3456
ELE 1223
BLE 4444
ELE 5555
BLE 4444 
'''
with open('klmp.txt','w') as f:
    f.write(ss)





seen = []
for line in fileinput.input('klmp.txt',inplace=1):
    b = line.split()[1]
    if b not in seen:
        seen.append(b)
        print line.strip()

在 SO 中使用单词“fileinput”搜索，我发现：

How to delete all blank lines in the file with the help of python?

【讨论】：

【解决方案2】：

elem1_seen = set()                 # first initialize an empty set of seen elem[1]
lines_out = []                     # list of "unique" output lines
for line in sys.stdin:             # iterate over input
    elems = line.rstrip().split()  # split line into two elements
    if elems[1] not in elem1_seen: # if second element not seen before...
        lines_out.append(line)     # append the whole line to output
        elem1_seen.add(elems[1])   # add this second element to seen before set
print lines_out                    # print output

【讨论】：

这个效果很好，比我尝试的更有意义:)

【解决方案3】：

主要问题是您正在更改变量类型，这会造成一些混乱：

import sys  

for line in sys.stdin:  
    line = line.rstrip()   #Line is a string  
    line = line.split()    #Line is a list
    idlist = []            #idlist is a list
    if idlist == []:  
        idlist = line[1]   #id list is a string
    else:  
        idlist.append(line[1])  #and now?
    print line[0], idlist

【讨论】：

我认为如果我说 idlist = [] idlist 将是一个空的 list？（因为列表用方括号标识）。
是的，但是当您说“idlist=line[1]”时，您正在创建一个覆盖原始定义的新变量（在本例中为字符串）
我明白了，很高兴知道！谢谢。
等等，我以为我已经用 line = line.split() 将该行更改为一个列表，因此我假设 idlist = line[1] 将是 list 中的第一个元素 我创造了...？
此时，line 是一个列表，但 line[1] 是第二个元素（字符串）而不是第一个