Python - 重用文件列表作为输入

【问题标题】：Python - Reuse a list of files as an inputPython - 重用文件列表作为输入
【发布时间】：2018-11-18 11:29:39
【问题描述】：

我使用 os.walk 在文件夹中递归查找 html 文件。
这些 html 包含字符串。当 os.walk 建立一个列表时，我会用 BeautifulSoup
提取这些字符串我尝试了以下代码，但它不起作用：

import os 
from bs4 import BeautifulSoup
for root, dirs, files in os.walk ("mydir"):
    for file in files:
        if file.endswith (".html"):
           print(os.path.join(root, file))
soup = BeautifulSoup(os.path.join(root, file), "html.parser")
soup.find all('a')

如何使用文件列表作为 BeautifulSoup 的输入？（并在 txt 文件中打印输出）

【问题讨论】：

在您第二次致电os.path.join 时，您错过了root。
我编辑了它，但没有任何改变

标签： python list beautifulsoup extract

【解决方案1】：

os.path.join返回文件路径不是内容，需要open()。

import os 
from bs4 import BeautifulSoup
for root, dirs, files in os.walk ("mydir"):
    for file in files:
        if file.endswith (".html"):
            currentFile = os.path.join(root, file)
            print(currentFile)
            with open(currentFile, 'r') as html:
                soup = BeautifulSoup(html.read(), "html.parser")
                links = soup.find_all('a')
                for link in links:
                    print(link['href'])

【讨论】：

代码运行缓慢，似乎阅读了内容。未提取字符串
编辑，替换soup.find all('a')的空格重试。
没有结果
您是否希望soup.find all('a') 将结果保存到文件或在控制台中打印？您应该循环查看上面的内容并检查您的 html 输入。
它有效。作为一个新手，你能解释一下 (currentFile) 中的 'r' 是什么吗？