Python：使用许多文件作为输入答案

【问题标题】：Python: Using many files as inputPython：使用许多文件作为输入
【发布时间】：2017-07-21 03:34:51
【问题描述】：

我有一个程序，它从网页中获取 URL，并将网页作为 .html 文件保存在我的桌面上的文件夹中。现在我需要使用这些相同的 .html 文件并将它们设置为下一个程序的输入。我的问题是如何将所有这些大约 400+ 的文件作为输入到将完成其余工作的函数中？我目前也在使用 python 2.7，但如果我需要使用它，我有最新的 python 可用。

【问题讨论】：

import os; os.listdir()?
使用glob.glob("/path/to/*.html")列出所有文件

标签： python python-2.7 file input indexing

【解决方案1】：

这应该可以解决您的问题 yourpath = '路径//到//文件'

import os
for root, dirs, files in os.walk(yourpath, topdown=False):#topdown traversing
    for name in files:
        print(os.path.join(root, name))
        stuff
    for name in dirs:
        print(os.path.join(root, name))
        stuff

【讨论】：

【解决方案2】：

可以使用glob.glob()返回所有匹配模式的文件路径名，然后遍历所有文件，一个一个地处理

html_files = glob.glob("/path/to/*html")

for html_file in html_files:
    with open(html_file) as inputs:
        for line in inputs:
            # do your work on the line

【讨论】：

【解决方案3】：

您的第二个函数可以采用如下文件名列表：

def process(files):
    for f in files:
        # do stuff

你可以通过做得到文件列表

import os
files = os.listdir('/path/to/files')

【讨论】：

如果我的文件是 html，我还能用这个吗？我将不得不打开并阅读它们，所以我还需要一个 urlopen(files) 吗？
Python 不会关心你的文件内容是什么。问题是那些 html 文件是否存储在您的机器上？如果是这样，urlopen 是不必要的，因为您可以使用 open 来阅读它们。
是的，我将文件存储在一个文件夹中，所以我正在做 file =open(path) 现在我当前的问题是我试图只获取 parapraphs 文本并对其进行标记。我使用 ntlk 进行标记，但首先我只需要段落标签中的正确文本。我正在尝试美丽的soup.find_all('p')