Beautiful Soup：从中提取所有 答案

【问题标题】：Beautiful Soup: Extracting all the from the Beautiful Soup：从中提取所有 
【发布时间】：2017-09-25 21:29:40
【问题描述】：

我有一个非常愚蠢和烦人的问题，我尝试将 html 转换为 markdown，但我的 html 格式很愚蠢：我一直有这样的东西：

<strong>Ihre Aufgaben:<br/></strong>

或

<strong> <br/>Über die XXXX GmbH:<br/></strong>

这是完全有效的 HTML。

但是我的库转换为 Markdown (HTML2Text) 将其转换为：

**Ihre Aufgaben:\n**

和

** \nÜber die XXXX GmbH:\n**

这是一个already reported issue，因为markdown 无效并且无法正确呈现

我解决这个问题的方法如下：

使用 BeautifulSoup 查找导致此问题的所有 strong
将  分为两组：文本之前的和文本之后的。
解开文本后面的那些，以便将它们推出

我的代码（格式还不是很好）：

soup = BeautifulSoup(html)
emphased = soup.find_all('strong')
for single in emphased:
    children = single.children
    before = 0
    foundText = None
    after = 0
    for child in children:
        if not isinstance(child, NavigableString):
             if foundText:
                after += 1
                child.unwrap()
             else:
                before += 1
                # DOES NOT WORK
                child.unwrap()
        else:
           foundText = single.get_text().strip()

我目前的问题是什么？

我想解开   before 内容并将它们 before  元素，但我无法实现（并且没有找到如何在文档中继续）。

我想更普遍地实现什么？：

我想改变它：

<strong> <br/>Über die XXXX GmbH: </strong>

进入

# Note the space
(whitespace)<br/><strong>Über die XXXX GmbH:</strong>(whitespace)

不一定要使用 Beautiful Soup，我只是不知道其他解决方案。

提前致谢！

【问题讨论】：

标签： python html parsing beautifulsoup dom-manipulation

【解决方案1】：

根据您的示例，您可以从 strong 中提取所有 br 标记并将它们放在前面，用新标记替换最新的标记。

这是一个sn-p：

from bs4 import BeautifulSoup

soup = BeautifulSoup("<strong>Ihre Aufgaben:<br/></strong>", "html.parser")
for strong in soup.find_all("strong"):
    [s.extract() for s in strong.find_all('br')]
    strong.string = strong.get_text(strip=True)
    strong.replaceWith(BeautifulSoup( " %s%s " % ("<br/>", strong), "html.parser"))
print soup

哪些输出：

 Ihre Aufgaben:

【讨论】：

我认为这个解决方案没有考虑到空格要求，即使添加它应该不会太难。（也许即使是一个简单的正则表达式？）
我的错误 - 在提取 br 标签后，我们可以剥离文本以去除那些字符。代码已更新；
嘿@Zroq，感谢您抽出宝贵时间，我对其进行了测试并回复您:)
好吧，你们帮了我很大的忙，我刚刚解决了我的问题！