解析嵌套在 XML 文件中的 HTML（使用 BeautifulSoup）答案

【问题标题】：Parsing HTML nested within XML file (using BeautifulSoup)解析嵌套在 XML 文件中的 HTML（使用 BeautifulSoup）
【发布时间】：2018-10-30 01:12:12
【问题描述】：

我正在尝试解析 XML 文件中的一些数据，该文件在其description 字段中包含 HTML。

例如，数据如下：

<xml>
    <description>
        <body>
           HTML I want
        </body>
    </description
    <description>
        <body>
           - more data I want -
        </body>
    </description>
</xml>

到目前为止，我想出的是这样的：

从 bs4 导入 BeautifulSoup

soup = BeautifulSoup(myfile, 'html.parser')
descContent = soup.find_all('description')
for i in descContent:
    bodies = i.find_all('body')
    # This will return an object of type 'ResultSet'
    for n in bodies:
        print n
        # Nothing prints here.

我不确定我哪里出错了；当我枚举descContent 中的条目时，它会显示我正在寻找的内容；棘手的部分是进入<body> 的嵌套条目。感谢收看！

编辑：经过进一步尝试，BeautifulSoup 似乎无法识别<description> 标记中的 HTML - 它只是文本，因此出现了问题。我正在考虑将结果保存为 HTML 文件并重新解析它，但不确定这是否可行，因为保存包含所有回车符和换行符的文字字符串...

【问题讨论】：

标签： python html xml parsing beautifulsoup

【解决方案1】：

在 lxml 中使用 xml 解析器
您可以使用
安装 lxml 解析器点安装lxml

with open("file.html") as fp:
    soup = BeautifulSoup(fp, 'xml')

for description in soup.find_all('description'):
    for body in description.find_all('body'):
        print body.text.replace('-', '').replace('\n', '').lstrip(' ')

或者你可以输入

print body.text

【讨论】：