从 HTML 标记中删除属性 [重复]答案

【问题标题】：Removing attributes from HTML tags [duplicate]从 HTML 标记中删除属性 [重复]
【发布时间】：2023-04-02 12:48:01
【问题描述】：

可能的重复：
php: how can I remove attributes from an html tag?
How do I iterate over the HTML attributes of a Beautiful Soup element?

我有一些 HTML，如下所示：

<div class="foo">
  <p id="first">Hello, world!</p>
  <p id="second">Stack Overflow</p>
</div>

它需要像这样回来：

<div>
  <p>Hello, world!</p>
  <p>Stack Overflow</p>
</div>

我更喜欢 Python 解决方案，因为我已经在需要使用它的程序中使用 BeautifulSoup。但是，如果这是一个更好的解决方案，我愿意接受 PHP。我认为 sed 正则表达式还不够，尤其是将来可能在文本中使用

【问题讨论】：

和how-do-i-iterate-over-the-html-attributes-of-a-beautiful-soup-element 和python-how-to-search-and-correct-html-tags-and-attributes 和python-extracting-html-tag-attributes-without-regular-expressions
到目前为止你尝试过什么？（请不要尝试使用正则表达式，特别是如果您已经知道如何使用 Beautiful Soup 之类的 HTML 解析器）。
我尝试过使用正则表达式，但它很长并且在某个地方出错。
我仍然建议使用 XSLT 来解决这个问题！一路身份模板。类似于stackoverflow.com/questions/7119923/removing-styling-from-html/…

标签： php python html regex beautifulsoup

【解决方案1】：

这也适用于 sed， ]+> 然后只需替换为第一组，例如，

【讨论】：

【解决方案2】：

这很容易在 Python 中使用 Lxml 实现。

首先安装Lxml并尝试以下代码：

from lxml.html import tostring, fromstring

html = '''
<div class="foo">
  <p id="first">Hello, world!</p>
  <p id="second">Stack Overflow</p>
</div>'''

htmlElement = fromstring(html)
for element in htmlElement.cssselect(''):
    for key in element.keys():
        element.attrib.pop(key)

result = tostring(htmlElement)

print result

【讨论】：