【问题标题】:Unable to scoop out desired portion of address out of long ones无法从长地址中挖出所需的地址部分
【发布时间】:2020-11-16 08:48:09
【问题描述】:

我正在尝试使用 BeautifulSoup 库从一些 html 元素中抓取地址。我的意图是获取地址直到最后一个County。我在这里面临的问题是所有地址中有两个County,所以我无法让我的脚本工作。

三个地址的来源:

<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
650 County Road 375<br><a title="YOUnity Clothing located in Jarrell, TX " href="/tx/williamson/jarrell">Jarrell</a>, <a title="YOUnity Clothing located in TX " href="/tx">TX</a> 76537<br><a title="YOUnity Clothing located in Williamson County, TX " href="/tx/williamson">Williamson County</a><br><br><div class="bizbtn"><a title="YOUnity Clothing" href="http://www.younityclothing.com" rel="nofollow" target="_blank">YOUnity Clothing Website</a></div>
</div>


<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div>
</div>


<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
28 County Road 884<br><a title="R&amp;P Painting located in Rainsville, AL " href="/al/dekalb/rainsville">Rainsville</a>, <a title="R&amp;P Painting located in AL " href="/al">AL</a> 35986<br><a title="R&amp;P Painting located in DeKalb County, AL " href="/al/dekalb">DeKalb County</a>
</div>

这就是他们在那里的样子:

['', 'Business Address:', '39829 County Road 452', 'Leesburg', ',', 'FL', '32788', 'Lake County', 'Eco Sciences, LLC Website', '']

['', 'Business Address:', '28 County Road 884', 'Rainsville', ',', 'AL', '35986', 'DeKalb County', '']

['', 'Business Address:', '650 County Road 375', 'Jarrell', ',', 'TX', '76537', 'Williamson County', 'YOUnity Clothing Website', '']

预期输出:

Business Address: 39829 County Road 452 Leesburg , FL 32788
Business Address: 28 County Road 884 Rainsville , AL 35986
Business Address: 650 County Road 375 Jarrell , TX 76537

到目前为止我已经尝试过:

from bs4 import BeautifulSoup

html = """
<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div>
</div>
"""
soup = BeautifulSoup(html,"lxml")
address = []
for i in soup.select_one(".bizgrid_hdr_address"):
    if not i.string:continue
    if 'County' in i.string.strip():break
    address.append(i.string.strip())
print(' '.join(address).strip())

不幸的是,上面的尝试只产生了Business Address:,因为它遇到了第一个County并打破了循环,而我的目标是抓住最后一个County

如何获取所需的地址部分?

【问题讨论】:

  • 试试soup.select_one(".bizgrid_hdr_address").text.replace('\n', '')
  • 要么您没有阅读说明,要么您没有理解我在@JaSON 之后的目的。无论如何,谢谢。

标签: python python-3.x web-scraping beautifulsoup


【解决方案1】:

尚未检查代码,但试图给出使用某种标志的想法。第一次遇到将更改标志为 1。第二次遇到将中断循环。

...
soup = BeautifulSoup(html,"lxml")
address = []

flag = 0
for i in soup.select_one(".bizgrid_hdr_address"):
    if not i.string:continue
    if 'County' in i.string.strip() and flag:
        break
    if 'County' in i.string.strip(): 
        flag = 1
    address.append(i.string.strip())
print(' '.join(address).strip())

【讨论】:

  • 如果在长地址的末尾有一个County 怎么办。我仍然希望将地址设为County
  • flag = 0; 中不需要那个分号 另外,在 Python 中,缩进 continuebreak 也很好
  • @MITHU 有1个、2个或更多Country的案例吗?
【解决方案2】:
b = "2356"
for x in soup.select(".col_biz"):
    x = [i.strip() for i in list(x.strings)]
    goal = [x[int(c)] for c in b]
    print(*goal)

输出:

650 County Road 375 Jarrell TX 76537
39829 County Road 452 Leesburg FL 32788
28 County Road 884 Rainsville AL 35986

或者

goal = [(x.contents[3].strip(), x.contents[5]['title'].split("in ")[-1].strip())
        for x in soup.select(".col_biz")]

输出:

[('650 County Road 375', 'Jarrell, TX'), ('39829 County Road 452', 'Leesburg, FL'), ('28 County Road 884', 'Rainsville, AL')]

【讨论】:

    【解决方案3】:

    我不确定这是否适用于大部分 HTML,但每个锚点中都有 Website 这个词,因此您可以按此过滤。

    例如:

    from bs4 import BeautifulSoup
    
    html = """<div class="col_biz bizgrid_hdr_address">
    <strong>Business Address:</strong><br>
    650 County Road 375<br><a title="YOUnity Clothing located in Jarrell, TX " href="/tx/williamson/jarrell">Jarrell</a>, <a title="YOUnity Clothing located in TX " href="/tx">TX</a> 76537<br><a title="YOUnity Clothing located in Williamson County, TX " href="/tx/williamson">Williamson County</a><br><br><div class="bizbtn"><a title="YOUnity Clothing" href="http://www.younityclothing.com" rel="nofollow" target="_blank">YOUnity Clothing Website</a></div>
    </div>
    
    
    <div class="col_biz bizgrid_hdr_address">
    <strong>Business Address:</strong><br>
    39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div>
    </div>
    
    
    <div class="col_biz bizgrid_hdr_address">
    <strong>Business Address:</strong><br>
    28 County Road 884<br><a title="R&amp;P Painting located in Rainsville, AL " href="/al/dekalb/rainsville">Rainsville</a>, <a title="R&amp;P Painting located in AL " href="/al">AL</a> 35986<br><a title="R&amp;P Painting located in DeKalb County, AL " href="/al/dekalb">DeKalb County</a>
    </div>
    """
    output = []
    for div in BeautifulSoup(html, "lxml").select(".bizgrid_hdr_address"):
        for item in div:
            if item.string and item.string.strip():
                text = item.string.strip()
                if "Website" in text:
                    continue
                output.append(text)
    
    addresses = [output[i:i+7] for i in range(0, len(output), 7)]
    for address in addresses:
        print(" ".join(address).replace(" ,", ","))
    

    这让你:

    Business Address: 650 County Road 375 Jarrell, TX 76537 Williamson County
    Business Address: 39829 County Road 452 Leesburg, FL 32788 Lake County
    Business Address: 28 County Road 884 Rainsville, AL 35986 DeKalb County
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-08-25
      • 1970-01-01
      • 2012-12-25
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-08-01
      • 2011-12-01
      相关资源
      最近更新 更多