美丽的汤嵌套标签搜索答案

【问题标题】：Beautiful Soup Nested Tag Search美丽的汤嵌套标签搜索
【发布时间】：2018-03-12 15:58:54
【问题描述】：

我正在尝试编写一个 python 程序来计算网页上的单词。我使用 Beautiful Soup 4 来抓取页面，但我无法访问嵌套的 HTML 标签（例如：<div> 内的<p class="hello">）。

每次我尝试使用page.findAll()（页面是包含整个页面的Beautiful Soup 对象）方法找到这样的标签时，它根本找不到任何标签，尽管有。有什么简单的方法或者其他的方法吗？

【问题讨论】：

请显示您尝试过的一些代码。以及您尝试抓取的页面。

【解决方案1】：

您不需要编写 for 循环。你可以试试这个：

BeautifulSoup(page_source, 'html.parser')\
    .findAll('div')\
    .findAll('p', {'class': 'hello'})

【讨论】：

【解决方案2】：

试试这个：

data = []
for nested_soup in soup.find_all('xyz'):
    data = data + nested_soup.find_all('abc')
# data holds all shit together

也许你可以把它变成 lambda 并让它变得很酷，但这很有效。谢谢。

【讨论】：

【解决方案3】：

更新：我注意到文本并不总是返回预期的结果，同时我意识到有一个内置的方式来获取文本，果然阅读the docs 我们读到有一个名为 get_text() 的方法，将其用作：

from bs4 import BeautifulSoup
fd = open('index.html', 'r')
website= fd.read()
fd.close()
soup = BeautifulSoup(website)
contents= soup.get_text(separator=" ")
print "number of words %d" %len(contents.split(" "))

不正确，请阅读上文。假设您在 index.html 中本地有您的 html 文件，您可以：

from bs4 import BeautifulSoup
import re
BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
fd = open('index.html', 'r')
website= fd.read()
soup = BeautifulSoup(website)
tags=soup.find_all(True) # find everything
print "there are %d" %len(tags)

count= 0
matcher= re.compile("(\s|\n|<br>)+")
for tag in tags:
if tag.name.lower() in BLACKLIST:
    continue
    temp = matcher.split(tag.text) # Split using tokens such as \s and \n
    temp = filter(None, temp) # remove empty elements in the list
    count +=len(temp)
print "number of words in the document %d" %count
fd.close()

请注意，它可能不准确，可能是由于格式错误、误报（它检测到任何单词，即使是代码）、使用 javascript 或 css 动态显示的文本或其他原因

【讨论】：

谢谢，但我希望只计算文本中的一些文本，特定类的
标记中的文本，而不是页面上的所有文本。

【解决方案4】：

也许我猜你正在尝试做的是首先查看特定的 div 标签，然后搜索其中的所有 p 标签并计算它们或做任何你想做的事情。例如：

soup = bs4.BeautifulSoup(content, 'html.parser') 

# This will get the div
div_container = soup.find('div', class_='some_class')  

# Then search in that div_container for all p tags with class "hello"
for ptag in div_container.find_all('p', class_='hello'):
    # prints the p tag content
    print(ptag.text)

希望有帮助

【讨论】：

如果在<div class="someclass"> 和<p class='hello"> 之间还有其他标签，如<div> 或<span>？
@Heinz 在您的情况下，如果
具有“类”，那么这不是问题，但如果您的意思是“

”深深嵌套在“divs”和“spans”中没有任何可以使用的类或静态 ID，然后您可以递归地执行此操作，直到达到所需的

。这是非常具体的案例。