【问题标题】:Remove Duplicates from Text File从文本文件中删除重复项
【发布时间】:2013-03-27 15:41:27
【问题描述】:

我想从文本文件中删除重复的单词。

我有一些文本文件,其中包含如下内容:

None_None

ConfigHandler_56663624
ConfigHandler_56663624
ConfigHandler_56663624
ConfigHandler_56663624

None_None

ColumnConverter_56963312
ColumnConverter_56963312

PredicatesFactory_56963424
PredicatesFactory_56963424

PredicateConverter_56963648
PredicateConverter_56963648

ConfigHandler_80134888
ConfigHandler_80134888
ConfigHandler_80134888
ConfigHandler_80134888

结果输出需要是:

None_None

ConfigHandler_56663624

ColumnConverter_56963312

PredicatesFactory_56963424

PredicateConverter_56963648

ConfigHandler_80134888

我只使用了这个命令: en=set(open('file.txt') 但它不起作用。

谁能帮我从文件中提取唯一的集合

谢谢

【问题讨论】:

标签: python string duplicates


【解决方案1】:

这是一个简单的解决方案,使用集合从文本文件中删除重复项。

lines = open('workfile.txt', 'r').readlines()

lines_set = set(lines)

out  = open('workfile.txt', 'w')

for line in lines_set:
    out.write(line)

【讨论】:

    【解决方案2】:

    这是关于保留顺序的选项(与集合不同),但仍然具有相同的行为(请注意,EOL 字符被故意剥离并且空白行被忽略)...

    from collections import OrderedDict
    
    with open('/home/jon/testdata.txt') as fin:
        lines = (line.rstrip() for line in fin)
        unique_lines = OrderedDict.fromkeys( (line for line in lines if line) )
    
    print unique_lines.keys()
    # ['None_None', 'ConfigHandler_56663624', 'ColumnConverter_56963312',PredicatesFactory_56963424', 'PredicateConverter_56963648', 'ConfigHandler_80134888']
    

    那么你只需要将上面的内容写入你的输出文件。

    【讨论】:

    • (line for line in lines if line) ?哇! :)
    【解决方案3】:

    以下是使用集合(无序结果)的方法:

    from pprint import pprint
    
    with open('input.txt', 'r') as f:
        print pprint(set(f.readlines()))
    

    此外,您可能希望摆脱换行符。

    【讨论】:

      【解决方案4】:
      def remove_duplicates(infile):
          storehouse = set()
          with open('outfile.txt', 'w+') as out:
              for line in open(infile):
                  if line not in storehouse:
                      out.write(line)
                      storehouse.add(line)
      
      remove_duplicates('infile.txt')
      

      【讨论】:

      • 虽然此代码可能会回答问题,但提供有关此代码为何和/或如何回答问题的额外上下文可提高其长期价值。不鼓励仅使用代码回答。
      【解决方案5】:

      如果你只是想得到不重复的输出,你可以使用uniqsort

      hvn@lappy: /tmp () $ sort -nr dup | uniq
      PredicatesFactory_56963424
      PredicateConverter_56963648
      None_None
      ConfigHandler_80134888
      ConfigHandler_56663624
      ColumnConverter_56963312
      

      对于python:

      In [2]: with open("dup", 'rt') as f:
          lines = f.readlines()
         ...:     
      
      In [3]: lines
      Out[3]: 
      ['None_None\n',
       '\n',
       'ConfigHandler_56663624\n',
       'ConfigHandler_56663624\n',
       'ConfigHandler_56663624\n',
       'ConfigHandler_56663624\n',
       '\n',
       'None_None\n',
       '\n',
       'ColumnConverter_56963312\n',
       'ColumnConverter_56963312\n',
       '\n',
       'PredicatesFactory_56963424\n',
       'PredicatesFactory_56963424\n',
       '\n',
       'PredicateConverter_56963648\n',
       'PredicateConverter_56963648\n',
       '\n',
       'ConfigHandler_80134888\n',
       'ConfigHandler_80134888\n',
       'ConfigHandler_80134888\n',
       'ConfigHandler_80134888\n']
      
      In [4]: set(lines)
      Out[4]: 
      set(['ColumnConverter_56963312\n',
           '\n',
           'PredicatesFactory_56963424\n',
           'ConfigHandler_56663624\n',
           'PredicateConverter_56963648\n',
           'ConfigHandler_80134888\n',
           'None_None\n'])
      

      【讨论】:

        【解决方案6】:
        import json
        myfile = json.load(open('yourfile', 'r'))
        uniq = set()
        for p in myfile:
        if p in uniq:
            print "duplicate : " + p
            del p
        else:
            uniq.add(p)
        print uniq
        

        【讨论】:

          【解决方案7】:

          这样就可以取出放入的相同文件

          import uuid
          
          def _remove_duplicates(filePath):
            f = open(filePath, 'r')
            lines = f.readlines()
            lines_set = set(lines)
            tmp_file=str(uuid.uuid4())
            out=open(tmp_file, 'w')
            for line in lines_set:
              out.write(line)
            f.close()
            os.rename(tmp_file,filePath)
          

          【讨论】:

            猜你喜欢
            • 2013-12-13
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2010-11-17
            • 2015-08-29
            • 1970-01-01
            相关资源
            最近更新 更多