python中漂亮的汤xml格式答案

【问题标题】：Beautiful soup xml formatting in pythonpython中漂亮的汤xml格式
【发布时间】：2017-08-24 19:35:41
【问题描述】：

我有一个 xml 数据集标签，格式如下：

<catchphrase "id=c0">unconscionable conduct</catchphrase>

我认为当他们制作数据集时，他们并没有像必须的那样格式化 id 属性：

<catchphrase id="c0">unconscionable conduct</catchphrase>

但是，当它通过 Python 中的 Beautiful Soap 库时，结果如下：

 soup = BeautifulSoup(content, 'xml')

结果

 <catchphrase>
   "id=c0"&gt;application for leave to appeal
  </catchphrase>

或

soup = BeautifulSoup(content, 'lxml')

结果

<html>
   <body>
    ...
     <catchphrase>
         application for leave to appeal
     </catchphrase>
    ....

我想看起来像第二个，但没有 html 和 body 标签（这是一个 XML 文档）。我不需要 id 属性。在将其写入文件之前，我也使用了soup.prettify('utf-8')，但我认为当我这样做时它的格式已经错误。

【问题讨论】：

标签： python xml beautifulsoup

【解决方案1】：

没有这样的标准方法，但你可以做的是用正确的方法替换有故障的部分，如下所示：

from bs4 import BeautifulSoup
content = '<catchphrase "id=c0">unconscionable conduct</catchphrase>'

soup = BeautifulSoup(content.replace('"id=', 'id="'), 'xml')
print soup

这会导致：

<catchphrase id="c0">unconscionable conduct</catchphrase>

这绝对是一个小技巧，因为没有标准的方法来处理这个问题，主要是因为在 BeautifulSoup 解析之前 XML 应该是正确的。

【讨论】：

我不敢相信我没有想到这一点。谢谢。