去除多余的beautifulsoup html标签答案

【问题标题】：Remove redundant beautifulsoup html tags去除多余的beautifulsoup html标签
【发布时间】：2021-03-29 23:05:50
【问题描述】：

如何删除 beautifulsoup 对象中的“冗余”html 标签？

以

为例

<html>
 <body>
  <div>
   <div>
    <div>
     <div>
      <div>
       <div>
        Close
       </div>
      </div>
     </div>
    </div>
   </div>
   <div>
    <div>
     <div style="width:80px">
      <div>
      </div>
      <div>
       <button>
        Close
       </button>
      </div>
     </div>
    </div>
   </div>
  </div>
  <div>
  </div>
 </body>
</html>

如何将多余的<div>标签（冗余，因为它们只增加深度，但不包含任何附加信息或属性）删除到以下结构：

<html>
 <body>
       <div>
        Close
       </div>
     <div style="width:80px">
       <button>
        Close
       </button>
     </div>
 </body>
</html>

就图形算法而言，我正在尝试将 beautifulsoup 树中的多个节点合并在一起，这些节点不包含字符串，也不包含属性。

【问题讨论】：

只是为了确保答案不太适合示例 html：您有一个带有文本 (<button>) 的元素，其中一个祖先具有属性，而另一个元素没有这样的祖先。但是文本元素有可能有两个具有属性的嵌套祖先吗？具有属性的元素是否可能没有带有文本的子元素？
是的，最好有一个尽可能通用的 sn-p :)

标签： python html beautifulsoup tree xml-parsing

【解决方案1】：

您可以使用unwrap() 将任何没有属性的 div（即div.attrs == {}）替换为它们的子元素：

for div in soup.find_all('div'):
    if not div.attrs:
        div.unwrap()

print(soup.prettify())的输出：

<html>
 <body>
  <button>
   Close
  </button>
  <div style="width:80px">
   <button>
    Close
   </button>
  </div>
 </body>
</html>

对于更新后的示例（见评论），它将是：

for div in soup.find_all('div'):
    if not div.attrs and div.div:
        div.unwrap()

即如果 div 没有属性并且后面跟着另一个 div，则删除 div

【讨论】：

非常感谢！我刚刚意识到我的示例案例不太正确并更新了问题.. 抱歉让您感到困惑

【解决方案2】：

我刚刚创建了一个似乎可以完成这项工作的 code-sn-p：

        for x in reversed(soup()):
            if not x.string and not x.attrs and len(x.findChildren(recursive=False)) <= 1:
                x.unwrap()

reversed 是必需的，否则空标签将被计为同级，从而阻止展开。

【讨论】：