【问题标题】:Remove lines that contain certain string删除包含特定字符串的行
【发布时间】:2012-08-15 12:08:54
【问题描述】:

我正在尝试从文本文件中读取文本、读取行、删除包含特定字符串的行(在本例中为“坏”和“顽皮”)。 我写的代码是这样的:

infile = file('./oldfile.txt')

newopen = open('./newfile.txt', 'w')
for line in infile :

    if 'bad' in line:
        line = line.replace('.' , '')
    if 'naughty' in line:
        line = line.replace('.', '')
    else:
        newopen.write(line)

newopen.close()

我是这样写的,但没有成功。

重要的是,如果文本的内容是这样的:

good baby
bad boy
good boy
normal boy

我不希望输出有空行。 所以不喜欢:

good baby

good boy
normal boy

但是像这样:

good baby
good boy
normal boy

我应该从上面的代码中编辑什么?

【问题讨论】:

  • 你为什么要在你想忽略的行中用空格替换点?
  • @Wooble 也许 OP 期望这是一个正则表达式,他将在 linenothingreplace 所有出现的 anything

标签: python line


【解决方案1】:

你可以像这样让你的代码更简单,更易读

bad_words = ['bad', 'naughty']

with open('oldfile.txt') as oldfile, open('newfile.txt', 'w') as newfile:
    for line in oldfile:
        if not any(bad_word in line for bad_word in bad_words):
            newfile.write(line)

使用Context Managerany

【讨论】:

    【解决方案2】:

    您可以简单地不将该行包含到新文件中而不是进行替换。

    for line in infile :
         if 'bad' not in line and 'naughty' not in line:
                newopen.write(line)
    

    【讨论】:

    • 我认为你想要“或”而不是“和”
    • 我也希望删除仅包含 bad 或 naghuty 之一的行。哪个是对的..?
    • @H.Choi 要么是not ('bad' in line or 'naughty' in line)要么是not 'bad' in line and not 'naughty' in line,所以这里的and应该是正确的。
    【解决方案3】:

    我用它从文本文件中删除不需要的单词:

    bad_words = ['abc', 'def', 'ghi', 'jkl']
    
    with open('List of words.txt') as badfile, open('Clean list of words.txt', 'w') as cleanfile:
        for line in badfile:
            clean = True
            for word in bad_words:
                if word in line:
                    clean = False
            if clean == True:
                cleanfile.write(line)
    

    或对目录中的所有文件执行相同操作:

    import os
    
    bad_words = ['abc', 'def', 'ghi', 'jkl']
    
    for root, dirs, files in os.walk(".", topdown = True):
        for file in files:
            if '.txt' in file:
                with open(file) as filename, open('clean '+file, 'w') as cleanfile:
                    for line in filename:
                        clean = True
                        for word in bad_words:
                            if word in line:
                                clean = False
                        if clean == True:
                            cleanfile.write(line)
    

    我确信一定有一种更优雅的方式来做到这一点,但这正是我想要的。

    【讨论】:

      【解决方案4】:

      今天我需要完成一项类似的任务,所以我根据我所做的一些研究写了一个完成任务的要点。 我希望有人会觉得这很有用!

      import os
      
      os.system('cls' if os.name == 'nt' else 'clear')
      
      oldfile = raw_input('{*} Enter the file (with extension) you would like to strip domains from: ')
      newfile = raw_input('{*} Enter the name of the file (with extension) you would like me to save: ')
      
      emailDomains = ['windstream.net', 'mail.com', 'google.com', 'web.de', 'email', 'yandex.ru', 'ymail', 'mail.eu', 'mail.bg', 'comcast.net', 'yahoo', 'Yahoo', 'gmail', 'Gmail', 'GMAIL', 'hotmail', 'comcast', 'bellsouth.net', 'verizon.net', 'att.net', 'roadrunner.com', 'charter.net', 'mail.ru', '@live', 'icloud', '@aol', 'facebook', 'outlook', 'myspace', 'rocketmail']
      
      print "\n[*] This script will remove records that contain the following strings: \n\n", emailDomains
      
      raw_input("\n[!] Press any key to start...\n")
      
      linecounter = 0
      
      with open(oldfile) as oFile, open(newfile, 'w') as nFile:
          for line in oFile:
              if not any(domain in line for domain in emailDomains):
                  nFile.write(line)
                  linecounter = linecounter + 1
                  print '[*] - {%s} Writing verified record to %s ---{ %s' % (linecounter, newfile, line)
      
      print '[*] === COMPLETE === [*]'
      print '[*] %s was saved' % newfile
      print '[*] There are %s records in your saved file.' % linecounter
      

      Gist 链接:emailStripper.py

      最好, 阿兹

      【讨论】:

        【解决方案5】:

        else 只连接到最后一个if。你要elif:

        if 'bad' in line:
            pass
        elif 'naughty' in line:
            pass
        else:
            newopen.write(line)
        

        另请注意,我删除了行替换,因为无论如何您都不写这些行。

        【讨论】:

          【解决方案6】:

          使用 python-textops 包:

          from textops import *
          
          'oldfile.txt' | cat() | grepv('bad') | tofile('newfile.txt')
          

          【讨论】:

            【解决方案7】:

            试试这个效果很好。

            import re
            
            text = "this is bad!"
            text = re.sub(r"(.*?)bad(.*?)$|\n", "", text)
            text = re.sub(r"(.*?)naughty(.*?)$|\n", "", text)
            print(text)
            
            

            【讨论】:

              【解决方案8】:

              Regex 比我使用的公认答案(对于我的 23 MB 测试文件)要快一些。但内容并不多。

              import re
              
              bad_words = ['bad', 'naughty']
              
              regex = f"^.*(:{'|'.join(bad_words)}).*\n"
              subst = ""
              
              with open('oldfile.txt') as oldfile:
                  lines = oldfile.read()
              
              result = re.sub(regex, subst, lines, re.MULTILINE) 
              
              with open('newfile.txt', 'w') as newfile:
                  newfile.write(result)
              
              

              【讨论】:

              • @JamesGeddes,谢谢。我以前从未想过这是一个问题。我已经阅读了链接,这很有意义。在这种情况下,我也将我的代码发布为文本,但我想将输出结果也显示为文本会更好。
              【解决方案9】:
              to_skip = ("bad", "naughty")
              out_handle = open("testout", "w")
              
              with open("testin", "r") as handle:
                  for line in handle:
                      if set(line.split(" ")).intersection(to_skip):
                          continue
                      out_handle.write(line)
              out_handle.close()
              

              【讨论】:

              • 如果输入文件中有this is bad! 之类的东西,将不起作用。
              【解决方案10】:
              bad_words = ['doc:', 'strickland:','\n']
              
              with open('linetest.txt') as oldfile, open('linetestnew.txt', 'w') as newfile:
                  for line in oldfile:
                      if not any(bad_word in line for bad_word in bad_words):
                          newfile.write(line)
              

              \n 是换行符的 Unicode 转义序列。

              【讨论】:

                猜你喜欢
                • 2012-03-21
                • 1970-01-01
                • 1970-01-01
                • 2016-01-05
                • 2020-02-04
                • 2014-04-10
                相关资源
                最近更新 更多