用 BeautifulSoup 包装标签的内容答案

【问题标题】：wrap the contents of a tag with BeautifulSoup用 BeautifulSoup 包装标签的内容
【发布时间】：2014-05-03 03:56:58
【问题描述】：

我想用 BeautifulSoup 包装标签的内容。这个：

<div class="footnotes">
    <p>Footnote 1</p>
    <p>Footnote 2</p>
</div>

应该变成这样：

<div class="footnotes">
  <ol>
    <p>Footnote 1</p>
    <p>Footnote 2</p>
  </ol>
</div>

所以我使用以下代码：

footnotes = soup.findAll("div", { "class" : "footnotes" })
footnotes_contents = ''
new_ol = soup.new_tag("ol") 
for content in footnotes[0].children:
    new_tag = soup.new_tag(content)
    new_ol.append(new_tag)

footnotes[0].clear()
footnotes[0].append(new_ol)

print footnotes[0]

但我得到以下信息：

<div class="footnotes"><ol><
    ></
    ><<p>Footnote 1</p>></<p>Footnote 1</p>><
    ></
    ><<p>Footnote 2</p>></<p>Footnote 2</p>><
></
></ol></div>

建议？

【问题讨论】：

我可以用 lxml 为你试试这个吗？
请。没问题。

标签： python beautifulsoup lxml

【解决方案1】：

使用 lxml：

import lxml.html as LH
import lxml.builder as builder
E = builder.E

doc = LH.parse('data')
footnote = doc.find('//div[@class="footnotes"]')
ol = E.ol()
for tag in footnote:
    ol.append(tag)
footnote.append(ol)
print(LH.tostring(doc.getroot()))

打印

<html><body><div class="footnotes">
    <ol><p>Footnote 1</p>
    <p>Footnote 2</p>
</ol></div></body></html>

请注意，使用lxml，元素（标签）只能位于树中的一个位置（因为每个元素只有一个父元素），因此将tag 附加到ol 也会将其从footnote 中删除.因此，与 BeautifulSoup 不同的是，您不需要以相反的顺序遍历内容，也不需要使用 insert(0,...)。您只需按顺序追加即可。

使用 BeautifulSoup：

import bs4 as bs
with open('data', 'r') as f:
    soup = bs.BeautifulSoup(f)

footnote = soup.find("div", { "class" : "footnotes" })
new_ol = soup.new_tag("ol")

for content in reversed(footnote.contents):
    new_ol.insert(0, content.extract())

footnote.append(new_ol)
print(soup)

打印

<html><body><div class="footnotes"><ol>
<p>Footnote 1</p>
<p>Footnote 2</p>
</ol></div></body></html>

【讨论】：

【解决方案2】：

只需使用tag.extract() 移动标签的.contents；不要尝试使用soup.new_tag 重新创建它们（它只需要一个标签名称，而不是整个标签对象）。不要在原始标签上调用.clear()； .extract() 已经删除了元素。

当内容被就地修改时，反向移动项目，如果你不注意，会导致元素被跳过。

最后，当您只需要对一个标签执行此操作时，请使用.find()。

您确实需要创建contents 列表的副本，因为它会被原地修改

footnotes = soup.find("div", { "class" : "footnotes" })
new_ol = soup.new_tag("ol")

for content in reversed(footnotes.contents):
    new_ol.insert(0, content.extract())

footnotes.append(new_ol)

演示：

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <div class="footnotes">
...     <p>Footnote 1</p>
...     <p>Footnote 2</p>
... </div>
... ''')
>>> footnotes = soup.find("div", { "class" : "footnotes" })
>>> new_ol = soup.new_tag("ol")
>>> for content in reversed(footnotes.contents):
...     new_ol.insert(0, content.extract())
... 
>>> footnotes.append(new_ol)
>>> print footnotes
<div class="footnotes"><ol>
<p>Footnote 1</p>
<p>Footnote 2</p>
</ol></div>

【讨论】：

@lorussian：已修复；我让它工作了，但是我正在进行的复制粘贴演示出现了一个小错误，我错过了，因为其他事情分散了我的注意力，无法真正验证最后一个关键步骤：重读帖子。