【问题标题】:beautifulsoup4 doesn't working for strings inside nested tagsbeautifulsoup4 不适用于嵌套标签内的字符串
【发布时间】:2023-01-12 15:14:15
【问题描述】:

我想从列表站点获取字符串“AKA”,但 Find_all 函数无法返回任何值。

import requests
from bs4 import BeautifulSoup

# Set the URL you want to scrape
url = 'https://classified.azcentral.com/azcentral-marketplace/category/Legals/Maricopa%20County'

# Use requests to get the contents
r = requests.get(url)

# Get the text of the contents
html_content = r.text

# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, "html.parser")

# Find all the sections containing the string "SHERIFF'S NOTICE OF SALE OF REAL PROPERTY"
sections = soup.find_all(string="NOTICE OF SALE")
print(sections)

我搜索了以前的答案并尝试实施他们的解决方案大约一个小时,但到目前为止都没有奏效。我试过字符串 documentation 但也许我不明白。

我希望有 15 个“AKA”字符串,但无论我做什么,都显示为零。 Ubuntu 18.04 上的 Python3

【问题讨论】:

    标签: python web-scraping beautifulsoup


    【解决方案1】:

    find_all()string 一起使用只会搜索作为该标签的直系后代的孩子。您可以扩大它的范围,以找到包含任何提及感兴趣短语的子项的 <div> 标签,但问题是它也会匹配包含整个页面的 <div>

    相反,我会建议使用 CSS 类。查看该页面的 HTML,.panel-body 类显示在每个广告中。此代码搜索 .panel-body 的所有匹配项:

    for section in soup.find_all("div", class_="panel-body"):
        print(section.text.strip()[:80])  # print just the first 80 characters of each match
    

    输出:

    MarketPlace is where you can find anything you need! Simply choose a category fo
    MARICOPA COUNTY NOTICE OF CALL FOR BIDS   NOTICE IS HEREBY GIVEN that sealed bid
    CV2021-051400 C22011672 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
    NO. PB2016-051918 NOTICE OF INITIAL HEARING  REGARDING: PETITION FOR  APPROVAL O
    CV2022-003436 C22011714 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
    CV2021011535 C22011653 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXEC
    CV2022-091920 C22011708 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
    CV2020-055896 C22011668 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
    CV2022-050418 C22011669 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
    CV2020-009284 C22011711 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
    CV2021-014484 C22011666 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
    NO. PB2022-050058 NOTICE TO CREDITORS (PUBLICATION) (Assigned to Honorable Vanes
    CV2021015245 C22011660 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXEC
    Case No. PB1992-004227 NOTICE OF INITIAL HEARING  REGARDING: PETITION TO  TERMIN
    Case No. PB2020-005222 NOTICE OF INITIAL HEARING  REGARDING: PETITION TO  TERMIN
    Case No. PB2020-000142 NOTICE OF INITIAL HEARING  REGARDING:PETITION TO  TERMINA
    Case No. PB2021-005139 NOTICE OF INITIAL HEARING  REGARDING: PETITION TO  TERMIN
    CV2022-010475 C22011118 SHERIFF'S NOTICE OF SALE OF REAL ESTATE ON EXECUTION  IN
    Case No. PB2022-005749 NOTICE OF INITIAL HEARING  REGARDING: PETITION FOR  APPOI
    CV2022-001756 C22010874 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
    CV2022-001946 C22010896 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
    Case No. PB2015-003466 NOTICE OF INITIAL HEARING  REGARDING: PETITION TO  TERMIN
    Case No. PB2016-001049 NOTICE OF INITIAL HEARING  REGARDING: PETITION FOR APPROV
    CV2021-093163 C22010867 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
    CV2022-051687 C22010863 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
    Case No. PB2022-005813 NOTICE OF INITIAL HEARING  REGARDING: PETITION FOR APPOIN
    

    嗯,这看起来大部分是正确的,除了第一个。顶部有一段文本也使用相同的 CSS 类。您可以通过始终删除第一个匹配项来过滤掉它:

    for section in soup.find_all("div", class_="panel-body")[1:]:
        print(section.text.strip()[:80])
    

    或者你可以离开它。无论如何,下一步将摆脱它。

    接下来,您只关心其中有“销售通知”的那些。

    for section in soup.find_all("div", class_="panel-body"):
        if "NOTICE OF SALE" in section.text:
            print(section.text.strip()[:80])
    

    接下来,您可能希望将完整的广告保存为字符串。

    notice_of_sale_ads = []
    for section in soup.find_all("div", class_="panel-body"):
        if "NOTICE OF SALE" in section.text:
            notice_of_sale_ads.append(section.text.strip())
    

    当我运行这个时,我得到 14 个匹配项。 (与您预期的 15 略有不同,但我在浏览器中得到相同的数字。)

    【讨论】:

      最近更新 更多