查找所有不包含某些文本字符串的文本文件答案

【问题标题】：Find all text files not containing some text string查找所有不包含某些文本字符串的文本文件
【发布时间】：2013-12-13 06:58:00
【问题描述】：

我使用的是 Python 2.7.1，我正在尝试识别所有不包含某些文本字符串的文本文件。

该程序一开始似乎正在运行，但每当我将文本字符串添加到文件时，它就会不断出现，就好像它不包含它一样（误报）。当我检查文本文件的内容时，字符串显然存在。

我尝试写的代码是

def scanFiles2(rdir,sstring,extens,start = '',cSens = False): 
    fList = []
    for fol,fols,fils in os.walk(rdir): 
        fList.extend([os.path.join(rdir,fol,fil) for fil in fils if fil.endswith(extens) and fil.startswith(start)]) 
    if fList: 
        for fil in fList: 
            rFil = open(fil) 
            for line in rFil: 
                if not cSens: 
                    line,sstring = line.lower(), sstring.lower() 
                if sstring in line:
                    fList.remove(fil) 
                    break
            rFil.close() 
    if fList:
        plur = 'files do' if len(fList) > 1 else 'file does'
        print '\nThe following %d %s not contain "%s":\n'%(len(fList),plur,sstring) 
        for fil in fList: 
            print fil 
    else: 
        print 'No files were found that don\'t contain %(sstring)s.'%locals() 
scanFiles2(rdir = r'C:\temp',sstring = '!!syn',extens = '.html', start = '#', cSens = False)

我猜代码中存在缺陷，但我真的没有看到。

更新

代码仍然会出现许多误报：确实包含搜索字符串但被识别为不包含搜索字符串的文件。

文本编码可能是这里的问题吗？我在搜索字符串前加上 U 以说明 Unicode 编码，但它没有任何区别。

Python 是否以某种方式缓存文件内容？我不这么认为，但这可能会导致文件在更正后仍会弹出。

某种恶意软件会导致这样的症状吗？对我来说似乎不太可能，但我有点急于解决这个问题。

【问题讨论】：

我已经尝试过了，它对我有用（只是更改了“extens”和“rdir”以匹配我当前的环境）
@le_vine：这很好，但对我来说，它仍然包含一些确实包含搜索字符串的文件。我应该补充一点，搜索字符串是最近添加到其中的。知道会发生什么吗？好像 Python 从缓存而不是磁盘或其他东西中获取文件内容......
代码中使用的命名约定不是最好的。代码中有太多fil、fLi。尝试大声朗读代码。尝试将文档中的名称用于相应功能，例如 dirpath, dirnames, filenames 而不是 fol, fols, fils

标签： python list python-2.7

【解决方案1】：

在迭代列表时修改元素会导致意外结果：

例如：

>>> lst = [1,2,4,6,3,8,0,5]
>>> for n in lst:
...     if n % 2 == 0:
...         lst.remove(n)
...
>>> lst
[1, 4, 3, 0, 5]

解决方法迭代复制

>>> lst = [1,2,4,6,3,8,0,5]
>>> for n in lst[:]:
...     if n % 2 == 0:
...         lst.remove(n)
...
>>> lst
[1, 3, 5]

或者，您可以附加有效的文件路径，而不是从整个文件列表中删除。

修改版（追加不包含sstring的文件而不是删除）：

def scanFiles2(rdir, sstring, extens, start='', cSens=False): 
    if not cSens: 
        # This only need to called once.
        sstring = sstring.lower() 
    fList = []
    for fol, fols, fils in os.walk(rdir): 
        for fil in fils: 
            if not (fil.startswith(start) and fil.endswith(extens)):
                continue
            fil = os.path.join(fol, fil)
            with open(fil) as rFil:
                for line in rFil: 
                    if not cSens: 
                        line = line.lower()
                    if sstring in line:
                        break
                else:
                    fList.append(fil)
    ...

list.remove 花费 O(n) 时间，而 list.append 花费 O(1)。见Time Complexity。
尽可能使用with 语句。

【讨论】：

@Ray, glob.glob 不递归。可以使用glob.glob，但是文件列表已经被os.walk得到了，所以似乎没有必要。你的意思是fnmatch.fnamtch？
@falsetru：当然！您不应该在修改列表时对其进行迭代！其他建议也非常有帮助。我会跟随他们所有人。 with 语句是否会使 rFil.close() 过时？ Thx 一百万，你不知道我有多需要它来工作！
@RubenGeert, with 语句负责关闭文件对象。见PEP 343: The 'with' statement。
我会使用yield fil 而不是fList.append。并提取确定文件是否应保存到单独函数中的代码。然后当前的scanFile2 将轻松执行任务：获取start, extens, match 并产生相应的路径。或者match() 可以移到外面。你为什么用path.join(rdir, fol, fil)而不是path.join(fol, fil)？
@J.F.Sebastian，感谢您的评论。我更新了代码以使用path.join(fol, fil)。

【解决方案2】：

Falsetru 已经向您展示了为什么在循环遍历列表时不应从列表中删除行；列表迭代器在缩短列表时不会也无法更新其计数器，因此如果处理了第 3 项但您删除了该项，则下一个迭代项 4 先前位于索引 5。

列表理解版本使用fnmatch.filter() 和any() 以及过滤器lambda 进行不区分大小写的匹配：

import fnmatch

def scanFiles2(rdir, sstring, extens, start='', cSens=False):
    lfilter = sstring.__eq__ if cSens else lambda l, s=sstring.lower(): l.lower() == s
    ffilter = '{}*{}'.format(start, extens)
    return [os.path.join(r, fname)
            for r, _, f in os.walk(rdir)
            for fname in fnmatch.filter(f, ffilter)
            if not any(lfilter(l) for l in open(os.path.join(root, fname)))]

但也许你最好坚持一个更易读的循环：

def scanFiles2(rdir, sstring, extens, start='', cSens=False):
    lfilter = sstring.__eq__ if cSens else lambda l, s=sstring.lower(): l.lower() == s
    ffilter = '{}*{}'.format(start, extens)
    result = []
    for root, _, files in os.walk(rdir):
        for fname in fnmatch.filter(files, ffilter):
            fname = os.path.join(r, fname)
            with open(fname) as infh:
                if not any(lfilter(l) for l in infh):
                    result.append(fname)
    return result

【讨论】：

【解决方案3】：

另一种替代方案，可以使用正则表达式打开搜索（尽管只使用 grep使用适当的选项，仍然会更好）：

import mmap
import os
import re
import fnmatch

def scan_files(rootdir, search_string, extension, start='', case_sensitive=False):
    rx = re.compile(re.escape(search_string), flags=re.I if not case_sensitive else 0)
    name_filter = start + '*' + extension
    for root, dirs, files in os.walk(rootdir):
        for fname in fnmatch.filter(files, name_filter):
            with open(os.path.join(root, fname)) as fin:
                try:
                    mm = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
                except ValueError:
                    continue # empty files etc.... include this or not?
                if not next(rx.finditer(mm), None):
                    yield fin.name

如果您希望名称具体化或像对待任何其他生成器一样对待它，请使用list...

【讨论】：

【解决方案4】：

请不要为此编写 python 程序。这个程序已经存在。使用 grep：

grep * -Ilre 'main' 2> /dev/null
99client/.git/COMMIT_EDITMSG
99client/taxis-android/build/incremental/mergeResources/production/merger.xml
99client/taxis-android/build/incremental/mergeResources/production/inputs.data
99client/taxis-android/build/incremental/mergeResources/production/outputs.data
99client/taxis-android/build/incremental/mergeResources/release/merger.xml
99client/taxis-android/build/incremental/mergeResources/release/inputs.data
99client/taxis-android/build/incremental/mergeResources/release/outputs.data
99client/taxis-android/build/incremental/mergeResources/debug/merger.xml
99client/taxis-android/build/incremental/mergeResources/debug/inputs.data
(...)

http://www.gnu.org/savannah-checkouts/gnu/grep/manual/grep.html#Introduction

如果您需要 python 中的列表，只需从中执行 grep 并收集结果。

【讨论】：