Python，遍历文件夹中的文件并进行字数统计答案

【问题标题】：Python, loop through files in a folder and do a word countPython，遍历文件夹中的文件并进行字数统计
【发布时间】：2012-01-31 15:25:05
【问题描述】：

我是 python 新手，我需要编写一个脚本来计算目录中所有 txt 文件中的所有单词。这是我到目前为止所拥有的，其他方法仅在打开 txt 文件时有效，但是当我进入目录时它会失败。我知道我需要在某个地方追加，我尝试了几种不同的方法，但运气不佳。

*edit 我希望将结果集中在一起。到目前为止，它有 2 个单独的结果。我尝试制作一个新列表并将其附上计数器。但它坏了。再次感谢，这是一个很好的社区

import re
import os
import sys
import os.path
import fnmatch
import collections

def search( file ):

    if os.path.isdir(path) == True:
        for root, dirs, files in os.walk(path):
            for file in files:
                words = re.findall('\w+', open(file).read().lower())
                ignore = ['the','a','if','in','it','of','or','on','and','to']
                counter=collections.Counter(x for x in words if x not in ignore)
                print(counter.most_common(10))

    else:
        words = re.findall('\w+', open(path).read().lower())
        ignore = ['the','a','if','in','it','of','or','on','and','to']
        counter=collections.Counter(x for x in words if x not in ignore)
        print(counter.most_common(10))

path = input("Enter file and path, place ' before and after the file path: ")
search(path)

raw_input("Press enter to close: ")

【问题讨论】：

“失败”是什么意思？除此之外，我在任何地方都看不到 .txt 限制。
if os.path.isdir(path) == True 可以缩短为if os.path.isdir(path)

标签： python

【解决方案1】：

将第 14 行更改为：

words = re.findall('\w+', open(os.path.join(root, file)).read().lower())

另外，如果您将输入行替换为

path = raw_input("Enter file and path")

那么你就不需要在路径之前和之后包含'了

【讨论】：

非常感谢，我知道这是小事。我看过这个。我应该添加另一个列表，然后将 counter=collections.Counter(x for x in words if x not in ignore) 附加到新列表然后打印吗？
这取决于你想要做什么。您是否只想打印每个单词在每个文件中出现的次数？您想在所有文件中查找最常用的单词吗？
atm 它为每个文件打印 10 个最常用的单词。分开。我希望它给我所有文件中最常用的 10 个单词。 ty
然后不要追加到列表中，只需添加计数器。也就是说，在开始时创建一个空计数器：total_counter = collections.Counter()，然后为每个文件创建total_counter += counter。（计数器的设计使它们可以叠加在一起）

【解决方案2】：

当迭代os.walk 的结果时，file 将只包含文件名而不包含包含它的目录。您需要将目录名称与文件名连接起来：

for root, dirs, files in os.walk(path):
    for name in files:
        file_path = os.path.join(root, name)
        #do processing on file_path here

我建议将处理文件的代码移到它自己的函数中——这样你就不需要写两次了，而且更容易调试问题。

【讨论】：

【解决方案3】：

看起来函数定义的参数错误。应该是：

def search(path):

ignore 是正确的，但可以通过使用集合而不是列表来加快速度：

ignore = set(['the','a','if','in','it','of','or','on','and','to'])

否则，这是很好看的代码:-)

【讨论】：

【解决方案4】：

改为：

for file in files:
    fullPath="%s/%s"%(path,file)

【讨论】：

【解决方案5】：

这是因为“文件”列表只包含文件名，而不是完整路径。你必须使用：

导入 os.path

...

并将“open(file)”替换为“open(os.path.join(root,file))”。

【讨论】：

【解决方案6】：

我建议查看generator tricks for system programmers by David M. Beazley。它展示了如何创建小的生成器循环来完成你在这里所做的一切。基本上，使用gengrep 示例，但将 grep 替换为字数统计：

# gencount.py
#
# Count the words in  a sequence of lines

import re, collections
def gen_count(lines):
    patc = re.compile('\w+')
    ignore = ['the','a','if','in','it','of','or','on','and','to']
    for line in lines:
        words = patc.findall(line)
        counter=collections.Counter(x for x in words if x not in ignore)
        for count in counter.most_common(10):
            yield count

# Example use

if __name__ == '__main__':
    from genfind import  gen_find
    from genopen import  gen_open
    from gencat  import  gen_cat
    path = raw_input("Enter file and path, place ' before and after the file path: ")

    findnames = gen_find("*.txt",path)
    openfiles = gen_open(findnames)
    alllines = gen_cat(openfiles)

    currcount = gen_count(alllines)
    for c in currcount:
        print c

【讨论】：

【解决方案7】：

您应该有两个函数：一个是遍历文件并计算字数，另一个是遍历目录中的文件并在找到目录时递归调用自身。 per-file 函数应该获取文件的完整路径并打开文件本身。
一次读取整个文件可能会使您的内存不足。逐行方法更好。比这更好的是编写一个生成器函数，例如一次读取 4K 文件并输出单个单词，但对于这个任务来说，这可能有点过火了。
看os.path.walk()。
如果您使用的是 Python 2，请使用 raw_input。人们将忽略“引用路径”提示。

【讨论】：