在python中使用beautifulsoup解析子目录中的xml文件答案

【问题标题】：parse xml files in subdirectories using beautifulsoup in python在python中使用beautifulsoup解析子目录中的xml文件
【发布时间】：2017-02-22 11:12:10
【问题描述】：

我在多个名为 f1、f2、f3、f4、...的子目录中有超过 5000 个 XML 文件每个文件夹包含 200 多个文件。目前我只想使用 BeautifulSoup 提取所有文件，因为我已经尝试过 lxml、elemetTree 和 minidom，但我正在努力通过 BeautifulSoup 完成它。

我可以提取子目录中的单个文件，但无法通过 BeautifulSoup 获取所有文件。

我查看了以下帖子：

XML parsing in Python using BeautifulSoup（提取单个文件）

Parsing all XML files in directory and all subdirectories（这是迷你版）

Reading 1000s of XML documents with BeautifulSoup（无法通过本帖获取文件）

这是我为提取单个文件而编写的代码：

from bs4 import BeautifulSoup

file = BeautifulSoup(open('./Folder/SubFolder1/file1.XML'),'lxml-xml') 

print(file.prettify())

当我尝试获取所有文件夹中的所有文件时，我正在使用以下代码：

from bs4 import BeautifulSoup

file = BeautifulSoup('//Folder/*/*.XML','lxml-xml') 

print(file.prettify())

然后我只得到 XML 版本，没有别的。我知道我必须使用 for 循环，但不确定如何使用它来解析循环中的所有文件。

我知道它会非常非常慢但是为了学习我想使用beautifulsoup来解析所有文件，或者如果不推荐for循环，那么如果我能得到一个更好的解决方案，我将不胜感激，但仅限于仅限美汤。

问候，

【问题讨论】：

标签： python xml-parsing beautifulsoup

【解决方案1】：

如果我对您的理解正确，那么您确实需要遍历文件，正如您已经想到的那样：

from bs4 import BeautifulSoup
from pathlib import Path

for filepath in Path('./Folder').glob('*/*.XML'):
    with filepath.open() as f:
        soup = BeautifulSoup(f,'lxml-xml')
    print(soup.prettify())

pathlib 只是处理路径的一种方法，在更高级别上使用对象。您可以使用 glob 和字符串路径实现相同的效果。

【讨论】：

【解决方案2】：

使用glob.glob 查找 XML 文档：

import glob

from bs4 import BeautifulSoup

for filename in glob.glob('//Folder/*/*.XML'):
    content = BeautifulSoup(filename, 'lxml-xml')
    print(content.prettify())

注意：不要隐藏内置函数/类file。

阅读BeautifulSoup Quick Start

【讨论】：