使用 BeautifulSoup4 查找包含文本的所有端节点答案

【问题标题】：Find all end-nodes that contain text using BeautifulSoup4使用 BeautifulSoup4 查找包含文本的所有端节点
【发布时间】：2019-06-13 09:25:02
【问题描述】：

我是 Python 和 BeautifulSoup4 的新手

我正在尝试（仅）提取所有标签的文本内容，这些标签要么是 'div'、'p'、'li'，而且仅从直接节点而非子节点中提取 - 因此有两个选项 @987654321 @

这些是我的尝试：

content = soup.find_all("b", "div", "p", text=True, recursive=False)

和

tags = ["div", "p", "li"]
content = soup.find_all(tags, text=True, recursive=False)

这两个都不给我输出，你知道我做错了什么吗？

编辑 - 添加更多代码和我正在测试的示例文档，print(content) 为空

import requests
from bs4 import BeautifulSoup

url = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-list"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})

soup = BeautifulSoup(response.text, "html.parser")

tags = ["div", "p", "li"]
content = soup.find_all(tags, text=True, recursive=False)

print(content)

【问题讨论】：

您的 HTML 文档是什么样的？
我计划在许多不同的 HTML 文档中使用它，我在 Java 中使用 JSoup 做过类似的事情......但这是一种不同的思维方式 - 现在将添加更多代码
你应该设置recursive=True（因为没有直接节点），或者使用find_all_previous/find_all_next等方法，最好使用lxml而不是html.parser。
感谢 William recursive=False 是故意的，但我相信您是说没有特定节点来遍历 recursive=False 将找不到任何东西

标签： python python-3.x beautifulsoup

【解决方案1】：

从您对上一个答案的问题和 cmets 我认为您正在尝试找到

最里面的标签

是“p”或“li”或“div”

应该包含一些文字

import requests
from bs4 import BeautifulSoup
from bs4 import NavigableString

url = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-list"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})

soup = BeautifulSoup(response.text, "html.parser")
def end_node(tag):
    if tag.name not in ["div", "p", "li"]:
        return False
    if isinstance(tag,NavigableString): #if str return
        return False
    if not tag.text: #if no text return false
        return False
    elif len(tag.find_all(text=False)) > 0: #no other tags inside other than text
        return False
    return True #if valid it reaches here
content = soup.find_all(end_node)
print(content) #all end nodes matching our criteria

输出样本

[<p>These instructions illustrate all major features of Beautiful Soup 4,
with examples. I show you what the library is good for, how it works,
how to use it, how to make it do what you want, and what to do when it
violates your expectations.</p>, <p>The examples in this documentation should work the same way in Python
2.7 and Python 3.2.</p>, <p>This documentation has been translated into other languages by
Beautiful Soup users:</p>, <p>Here are some simple ways to navigate that data structure:</p>, <p>One common task is extracting all the URLs found within a page’s &lt;a&gt; tags:</p>, <p>Another common task is extracting all the text from a page:</p>, <p>Does this look like what you need? If so, read on.</p>, <p>If you’re using a recent version of Debian or Ubuntu Linux, you can
install Beautiful Soup with the system package manager:</p>, <p>I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it
should work with other recent versions.</p>, <p>Beautiful Soup is packaged as Python 2 code. When you install it for
use with Python 3, it’s automatically converted to Python 3 code. If
you don’t install the package, the code won’t be converted. There have
also been reports on Windows machines of the wrong version being
installed.</p>, <p>In both cases, your best bet is to completely remove the Beautiful
Soup installation from your system (including any directory created
when you unzipped the tarball) and try the installation again.</p>, <p>This table summarizes the advantages and disadvantages of each parser library:</p>, <li>Batteries included</li>, <li>Decent speed</li>, 
....
]

【讨论】：

【解决方案2】：

你可以遍历你的标签，然后在每个标签上应用soup.find_all()：

import requests
from bs4 import BeautifulSoup

url = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-list"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})

soup = BeautifulSoup(response.text, features="lxml")

tags = ["div", "p", "li"]

for tag in tags:
    content = soup.find_all(tag, recursive=True)

    for x in content:
        print(x)

在 HTML 页面上打印出每个 <div>、 和 <li> 标记。

也可以设置recursive=True递归遍历文档，提取所有嵌套的子标签。如果您不想要这些嵌套的孩子，请保留recursive=False。

您也可以改用lxml，它比html.parser 更快。您可以在此answer 中看到差异。如果 HTML 文档非常大，这可能是有益的。

【讨论】：

【解决方案3】：

根据@Bitto Bennichan 的回答进行一些更改。通过检查下一个孩子，这应该提取包含 NavStr 的标签。例如，它会提取 testStrong 而不是忽略它。

def child_contains_navstr(children_tags):
  func = lambda child: isinstance(child.string, NavigableString)
  return map(func, children_tags)

def end_node_with_text(tag, incl_embed_navstr=True):
  if tag.name not in ["div", "p", "li"]:
    return False

  # check if the children nodes have deeper tag or 
  # just NavigableString, <em><b><strong> etc....
  nav_str_res = list(child_contains_navstr(tag.findChildren()))
  if all(nav_str_res) and nav_str_res:
    # print(children)
    return True if incl_embed_navstr else False

  # check if there it is a empty text node
  if not tag.text.replace('\n', ''):
    return False

  # check if there are some other tag wrapping around 
  if len(tag(text=False)) > 0:
    return False
  
  # final node
  return True

【讨论】：