【问题标题】:Removing duplicate records in a file [duplicate]删除文件中的重复记录 [重复]
【发布时间】:2012-01-06 04:53:23
【问题描述】:

可能重复:
How might I remove duplicate lines from a file?

我有一个要删除的包含重复记录的文件。这是我尝试过的

import sys  

for line in sys.stdin:  
    line = line.rstrip()  
    line = line.split()  
    idlist = []   
    if idlist == []:  
        idlist = line[1]  
    else:  
    idlist.append(line[1])  
    print line[0], idlist  

#没用

还有这个

for line in sys.stdin:  
    line = line.rstrip()  
    line = line.split()  
    lines_seen = set()  
    dup = line[1]  
    if dup not in lines_seen:  
        lines_seen = dup  
    else:  
        lines_seen.append(dup)  
    print line[0], lines_seen  
    
sys.stdin.close()

#也不起作用!

这就是输入的样子

BLE 1234
BLE 1223
LLE 3456
ELE 1223
BLE 4444
ELE 5555
BLE 4444

这就是我希望输出的样子

BLE 1234
BLE 1223
LLE 3456
BLE 4444
ELE 5555

谢谢! 边缘

【问题讨论】:

  • 您认为什么是“重复记录”?
  • 为什么“BLE 1223”不在您想要的输出中?为什么“LLE 3456”和“ELE 1223”的顺序在所需输出中颠倒了?
  • 重复记录在此示例中,我重点关注第二列,即“1223”和“4444”。

标签: python


【解决方案1】:
import fileinput

ss = '''BLE 1234
BLE 1223
LLE 3456
ELE 1223
BLE 4444
ELE 5555
BLE 4444 
'''
with open('klmp.txt','w') as f:
    f.write(ss)





seen = []
for line in fileinput.input('klmp.txt',inplace=1):
    b = line.split()[1]
    if b not in seen:
        seen.append(b)
        print line.strip()

在 SO 中使用单词“fileinput”搜索,我发现:

How to delete all blank lines in the file with the help of python?

【讨论】:

    【解决方案2】:
    elem1_seen = set()                 # first initialize an empty set of seen elem[1]
    lines_out = []                     # list of "unique" output lines
    for line in sys.stdin:             # iterate over input
        elems = line.rstrip().split()  # split line into two elements
        if elems[1] not in elem1_seen: # if second element not seen before...
            lines_out.append(line)     # append the whole line to output
            elem1_seen.add(elems[1])   # add this second element to seen before set
    print lines_out                    # print output
    

    【讨论】:

    • 这个效果很好,比我尝试的更有意义:)
    【解决方案3】:

    主要问题是您正在更改变量类型,这会造成一些混乱:

    import sys  
    
    for line in sys.stdin:  
        line = line.rstrip()   #Line is a string  
        line = line.split()    #Line is a list
        idlist = []            #idlist is a list
        if idlist == []:  
            idlist = line[1]   #id list is a string
        else:  
            idlist.append(line[1])  #and now?
        print line[0], idlist 
    

    【讨论】:

    • 我认为如果我说 idlist = [] idlist 将是一个空的 list? (因为列表用方括号标识)。
    • 是的,但是当您说“idlist=line[1]”时,您正在创建一个覆盖原始定义的新变量(在本例中为字符串)
    • 我明白了,很高兴知道!谢谢。
    • 等等,我以为我已经用 line = line.split() 将该行更改为一个列表,因此我假设 idlist = line[1] 将是 list 中的第一个元素 我创造了...?
    • 此时,line 是一个列表,但 line[1] 是第二个元素(字符串)而不是第一个
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2020-06-28
    • 1970-01-01
    • 2016-01-07
    • 1970-01-01
    • 2021-03-17
    • 2015-01-03
    相关资源
    最近更新 更多