BeautifulSoup：只要进入一个标签，不管有多少封闭标签答案

【问题标题】：BeautifulSoup: just get inside of a tag, no matter how many enclosing tags there areBeautifulSoup：只要进入一个标签，不管有多少封闭标签
【发布时间】：2011-02-26 18:32:54
【问题描述】：

我正在尝试使用 BeautifulSoup 从网页中的  元素中抓取所有内部 html。有内部标签，但我不在乎，我只想获取内部文本。

例如，对于：

<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>

如何提取：

Red
Blue
Yellow
Light green

.string 和 .contents[0] 都不能满足我的需求。 .extract() 也没有，因为我不想提前指定内部标签——我想处理任何可能发生的事情。

BeautifulSoup 中是否有“获取可见 HTML”类型的方法？

----更新-----

根据建议，尝试：

soup = BeautifulSoup(open("test.html"))
p_tags = soup.findAll('p',text=True)
for i, p_tag in enumerate(p_tags): 
    print str(i) + p_tag

但这无济于事 - 它会打印出来：

0Red
1

2Blue
3

4Yellow
5

6Light 
7green
8

【问题讨论】：

标签： python beautifulsoup

【解决方案1】：

简答：soup.findAll(text=True)

here on StackOverflow 和BeautifulSoup documentation 已经回答了这个问题。

更新：

澄清一下，一段工作代码：

>>> txt = """\
... <p>Red</p>
... <p><i>Blue</i></p>
... <p>Yellow</p>
... <p>Light <b>green</b></p>
... """
>>> import BeautifulSoup
>>> BeautifulSoup.__version__
'3.0.7a'
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> for node in soup.findAll('p'):
...     print ''.join(node.findAll(text=True))

Red
Blue
Yellow
Light green

【讨论】：

谢谢！我看过这两个，但未能提取 StackOverflow 问题的重要部分 - 我发现 BeautifulSoup 文档只有在您已经知道自己在做什么时才真正有用。或者也许我只是需要更多的咖啡。
打印 ''.join(soup.findAll(text=True))
我添加了一个工作代码示例来说明如何使用.findAll(text=True) 来得到你想要的。
你也可以使用node.findAll(text=True)[0]
考虑这个：'<a href="http://abc.xyz.com/">Business</a>' 作为 BeautifulSoup() 的数据。它不再起作用了。

【解决方案2】：

接受的答案很好，但现在已经 6 岁了，所以这是这个答案的当前Beautiful Soup 4 version：

>>> txt = """\
<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>
"""
>>> from bs4 import BeautifulSoup, __version__
>>> __version__
'4.5.1'
>>> soup = BeautifulSoup(txt, "html.parser")
>>> print("".join(soup.strings))

Red
Blue
Yellow
Light green

【讨论】：

【解决方案3】：

首先，使用str 将html 转换为字符串。然后，在您的程序中使用以下代码：

import re
x = str(soup.find_all('p'))
content = str(re.sub("<.*?>", "", x))

这称为regex。这将删除两个 html 标签（包括标签）之间的任何内容。

【讨论】：

【解决方案4】：

通常从网站报废的数据会包含标签。为了避免标签并只显示文本内容，您可以使用 text 属性。

例如，

    from BeautifulSoup import BeautifulSoup

    import urllib2 
    url = urllib2.urlopen("https://www.python.org")

    content = url.read()

    soup = BeautifulSoup(content)

    title = soup.findAll("title")

    paragraphs = soup.findAll("p")

    print paragraphs[1] //Second paragraph with tags

    print paragraphs[1].text //Second paragraph without tags

在这个例子中，我从 python 站点收集所有段落，并用标签和不带标签的方式显示它。

【讨论】：

【解决方案5】：

我偶然发现了同样的问题，并想分享这个解决方案的 2019 版本。也许它可以帮助某人。

# importing the modules
from bs4 import BeautifulSoup
from urllib.request import urlopen

# setting up your BeautifulSoup Object
webpage = urlopen("https://insertyourwebpage.com")
soup = BeautifulSoup( webpage.read(), features="lxml")
p_tags = soup.find_all('p')


for each in p_tags: 
    print (str(each.get_text()))

请注意，我们首先将数组内容一个一个打印出来，然后调用 get_text() 方法从文本中去除标签，这样我们就只打印出文本。

还有：

在 bs4 中使用更新的 'find_all()' 比使用旧的 findAll() 更好
urllib2 被 urllib.request 和 urllib.error 替换，见here

现在你的输出应该是：

红色
蓝色
黄色
光

希望这对寻找更新解决方案的人有所帮助。

【讨论】：

请注意，// 不是 cmets 在 Python 中的工作方式。尝试添加工作代码:)
是的，你是对的。感谢您指出。刚刚改了:)