Beautifulsoup find_all() 捕获太多文本答案

【问题标题】：Beautifulsoup find_all() captures too much textBeautifulsoup find_all() 捕获太多文本
【发布时间】：2020-07-09 09:47:22
【问题描述】：

我使用 BeautifulSoup 包在 Python 中解析了一些 HTML。这是 HTML：

<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>

我正在使用此代码块捕获结果：

names = soup3.find_all('div', {'class': "n"}) 
contact = soup3.find_all('div', {'class': "x"})  
other = soup3.find_all('div', {'class': "x c"})

现在，“x”和“x c”两个类都被捕获在“contact”变量中。我怎样才能防止这种情况发生？

【问题讨论】：

class x 在Address 和Phone 和Other 上出现了3 次，为什么你只是不按位置取货？

标签： python html web-scraping beautifulsoup

【解决方案1】：

试试：

soup.select('div[class="x"]')

输出：

[<div class="x">Address</div>, <div class="x">Phone</div>]

【讨论】：

【解决方案2】：

from bs4 import BeautifulSoup

html = """
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
"""

soup = BeautifulSoup(html, 'html.parser')

contact = soup.findAll("div", class_="x")[1]

print(contact)

输出：

<div class="x">Phone</div>

【讨论】：

【解决方案3】：

使用集合怎么样？

others = set(soup.find_all('div', {'class': "x c"}))
contacts = set(soup.find_all('div', {'class': "x"})) - others

其他人将是{<div class="x c">Other</div>} 和 联系人将是{<div class="x">Phone</div>, <div class="x">Address</div>}

请注意，这仅适用于这种特定的类情况。它可能无法正常工作，具体取决于您在 HTML 中的类组合。

有关.find_all() 工作原理的更多详细信息，请参阅BeautifulSoup webscraping find_all( ): finding exact match。

【讨论】：