【发布时间】:2020-11-16 08:48:09
【问题描述】:
我正在尝试使用 BeautifulSoup 库从一些 html 元素中抓取地址。我的意图是获取地址直到最后一个County。我在这里面临的问题是所有地址中有两个County,所以我无法让我的脚本工作。
三个地址的来源:
<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
650 County Road 375<br><a title="YOUnity Clothing located in Jarrell, TX " href="/tx/williamson/jarrell">Jarrell</a>, <a title="YOUnity Clothing located in TX " href="/tx">TX</a> 76537<br><a title="YOUnity Clothing located in Williamson County, TX " href="/tx/williamson">Williamson County</a><br><br><div class="bizbtn"><a title="YOUnity Clothing" href="http://www.younityclothing.com" rel="nofollow" target="_blank">YOUnity Clothing Website</a></div>
</div>
<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div>
</div>
<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
28 County Road 884<br><a title="R&P Painting located in Rainsville, AL " href="/al/dekalb/rainsville">Rainsville</a>, <a title="R&P Painting located in AL " href="/al">AL</a> 35986<br><a title="R&P Painting located in DeKalb County, AL " href="/al/dekalb">DeKalb County</a>
</div>
这就是他们在那里的样子:
['', 'Business Address:', '39829 County Road 452', 'Leesburg', ',', 'FL', '32788', 'Lake County', 'Eco Sciences, LLC Website', '']
['', 'Business Address:', '28 County Road 884', 'Rainsville', ',', 'AL', '35986', 'DeKalb County', '']
['', 'Business Address:', '650 County Road 375', 'Jarrell', ',', 'TX', '76537', 'Williamson County', 'YOUnity Clothing Website', '']
预期输出:
Business Address: 39829 County Road 452 Leesburg , FL 32788
Business Address: 28 County Road 884 Rainsville , AL 35986
Business Address: 650 County Road 375 Jarrell , TX 76537
到目前为止我已经尝试过:
from bs4 import BeautifulSoup
html = """
<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div>
</div>
"""
soup = BeautifulSoup(html,"lxml")
address = []
for i in soup.select_one(".bizgrid_hdr_address"):
if not i.string:continue
if 'County' in i.string.strip():break
address.append(i.string.strip())
print(' '.join(address).strip())
不幸的是,上面的尝试只产生了Business Address:,因为它遇到了第一个County并打破了循环,而我的目标是抓住最后一个County。
如何获取所需的地址部分?
【问题讨论】:
-
试试
soup.select_one(".bizgrid_hdr_address").text.replace('\n', '') -
要么您没有阅读说明,要么您没有理解我在@JaSON 之后的目的。无论如何,谢谢。
标签: python python-3.x web-scraping beautifulsoup