剥离 HTML 标签以在 python 中获取字符串答案

【问题标题】：Strip HTML tags to get strings in python剥离 HTML 标签以在 python 中获取字符串
【发布时间】：2014-04-07 13:48:07
【问题描述】：

我尝试使用 BeautifulSoup 从 HTML 文件中获取一些字符串，但每次使用它时都会得到部分结果。

我想获取每个 li 元素/标签中的字符串。到目前为止，我已经能够像这样获得 ul 中的所有内容。

#!/usr/bin/python
from bs4 import BeautifulSoup
page = open("page.html")
soup = BeautifulSoup(page)
source = soup.select(".sidebar li")

我得到的是这样的：

[<li class="first">
        Def Leppard -  Make Love Like A Man<span>Live</span> </li>, <li>
        Inxs - Never Tear Us Apart        </li>, <li>
        Gary Moore - Over The Hills And Far Away        </li>, <li>
        Linkin Park -  Numb        </li>, <li>
        Vita De Vie -  Basul Si Cu Toba Mare        </li>, <li>
        Nazareth - Love Hurts        </li>, <li>
        U2 - I Still Haven't Found What I'm L        </li>, <li>
        Blink 182 -  All The Small Things        </li>, <li>
        Scorpions -  Wind Of Change        </li>, <li>
        Iggy Pop - The Passenger        </li>]

我只想从中获取字符串。

【问题讨论】：

问题解决了吗？任何答案有帮助吗？如果是，请选择一项并接受。谢谢。

标签： python html html-parsing beautifulsoup strip

【解决方案1】：

使用漂亮的汤 - .strings 方法。

for string in soup.stripped_strings:
print(repr(string))

来自文档：

如果标签中包含多个内容，您仍然可以查看只是字符串。使用 .strings 生成器：

或

这些字符串往往有很多额外的空格，你可以改为使用 .stripped_strings 生成器删除：

【讨论】：

【解决方案2】：

遍历结果，得到text属性的值：

for element in soup.select(".sidebar li"):
    print element.text

例子：

from bs4 import BeautifulSoup


data = """
<body>
    <ul>
        <li class="first">Def Leppard -  Make Love Like A Man<span>Live</span> </li>
        <li>Inxs - Never Tear Us Apart        </li>
    </ul>
</body>
"""

soup = BeautifulSoup(data)
for element in soup.select('li'):
    print element.text

打印：

Def Leppard -  Make Love Like A ManLive 
Inxs - Never Tear Us Apart

【讨论】：

这工作得很好，但在第一行我也有 Live 我想摆脱它。
@cbomb text 可以处理这个问题并从所有嵌套标签中提取文本，请参阅我提供的示例。希望对您有所帮助。

【解决方案3】：

documentation 中的这个例子给出了一个非常好的单行。

''.join(BeautifulSoup(source).findAll(text=True))

【讨论】：