无法从长地址中挖出所需的地址部分答案

【问题标题】：Unable to scoop out desired portion of address out of long ones无法从长地址中挖出所需的地址部分
【发布时间】：2020-11-16 08:48:09
【问题描述】：

我正在尝试使用 BeautifulSoup 库从一些 html 元素中抓取地址。我的意图是获取地址直到最后一个County。我在这里面临的问题是所有地址中有两个County，所以我无法让我的脚本工作。

三个地址的来源：

<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
650 County Road 375<br><a title="YOUnity Clothing located in Jarrell, TX " href="/tx/williamson/jarrell">Jarrell</a>, <a title="YOUnity Clothing located in TX " href="/tx">TX</a> 76537<br><a title="YOUnity Clothing located in Williamson County, TX " href="/tx/williamson">Williamson County</a><br><br><div class="bizbtn"><a title="YOUnity Clothing" href="http://www.younityclothing.com" rel="nofollow" target="_blank">YOUnity Clothing Website</a></div>
</div>


<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div>
</div>


<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
28 County Road 884<br><a title="R&amp;P Painting located in Rainsville, AL " href="/al/dekalb/rainsville">Rainsville</a>, <a title="R&amp;P Painting located in AL " href="/al">AL</a> 35986<br><a title="R&amp;P Painting located in DeKalb County, AL " href="/al/dekalb">DeKalb County</a>
</div>

这就是他们在那里的样子：

['', 'Business Address:', '39829 County Road 452', 'Leesburg', ',', 'FL', '32788', 'Lake County', 'Eco Sciences, LLC Website', '']

['', 'Business Address:', '28 County Road 884', 'Rainsville', ',', 'AL', '35986', 'DeKalb County', '']

['', 'Business Address:', '650 County Road 375', 'Jarrell', ',', 'TX', '76537', 'Williamson County', 'YOUnity Clothing Website', '']

预期输出：

Business Address: 39829 County Road 452 Leesburg , FL 32788
Business Address: 28 County Road 884 Rainsville , AL 35986
Business Address: 650 County Road 375 Jarrell , TX 76537

到目前为止我已经尝试过：

from bs4 import BeautifulSoup

html = """
<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div>
</div>
"""
soup = BeautifulSoup(html,"lxml")
address = []
for i in soup.select_one(".bizgrid_hdr_address"):
    if not i.string:continue
    if 'County' in i.string.strip():break
    address.append(i.string.strip())
print(' '.join(address).strip())

不幸的是，上面的尝试只产生了Business Address:，因为它遇到了第一个County并打破了循环，而我的目标是抓住最后一个County。

如何获取所需的地址部分？

【问题讨论】：

试试soup.select_one(".bizgrid_hdr_address").text.replace('\n', '')
要么您没有阅读说明，要么您没有理解我在@JaSON 之后的目的。无论如何，谢谢。

标签： python python-3.x web-scraping beautifulsoup

【解决方案1】：

尚未检查代码，但试图给出使用某种标志的想法。第一次遇到将更改标志为 1。第二次遇到将中断循环。

...
soup = BeautifulSoup(html,"lxml")
address = []

flag = 0
for i in soup.select_one(".bizgrid_hdr_address"):
    if not i.string:continue
    if 'County' in i.string.strip() and flag:
        break
    if 'County' in i.string.strip(): 
        flag = 1
    address.append(i.string.strip())
print(' '.join(address).strip())

【讨论】：

如果在长地址的末尾有一个County 怎么办。我仍然希望将地址设为County。
flag = 0; 中不需要那个分号另外，在 Python 中，缩进 continue 和 break 也很好
@MITHU 有1个、2个或更多Country的案例吗？

【解决方案2】：

b = "2356"
for x in soup.select(".col_biz"):
    x = [i.strip() for i in list(x.strings)]
    goal = [x[int(c)] for c in b]
    print(*goal)

输出：

650 County Road 375 Jarrell TX 76537
39829 County Road 452 Leesburg FL 32788
28 County Road 884 Rainsville AL 35986

或者

goal = [(x.contents[3].strip(), x.contents[5]['title'].split("in ")[-1].strip())
        for x in soup.select(".col_biz")]

输出：

[('650 County Road 375', 'Jarrell, TX'), ('39829 County Road 452', 'Leesburg, FL'), ('28 County Road 884', 'Rainsville, AL')]

【讨论】：

【解决方案3】：

我不确定这是否适用于大部分 HTML，但每个锚点中都有 Website 这个词，因此您可以按此过滤。

例如：

from bs4 import BeautifulSoup

html = """<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
650 County Road 375<br><a title="YOUnity Clothing located in Jarrell, TX " href="/tx/williamson/jarrell">Jarrell</a>, <a title="YOUnity Clothing located in TX " href="/tx">TX</a> 76537<br><a title="YOUnity Clothing located in Williamson County, TX " href="/tx/williamson">Williamson County</a><br><br><div class="bizbtn"><a title="YOUnity Clothing" href="http://www.younityclothing.com" rel="nofollow" target="_blank">YOUnity Clothing Website</a></div>
</div>


<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div>
</div>


<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
28 County Road 884<br><a title="R&amp;P Painting located in Rainsville, AL " href="/al/dekalb/rainsville">Rainsville</a>, <a title="R&amp;P Painting located in AL " href="/al">AL</a> 35986<br><a title="R&amp;P Painting located in DeKalb County, AL " href="/al/dekalb">DeKalb County</a>
</div>
"""
output = []
for div in BeautifulSoup(html, "lxml").select(".bizgrid_hdr_address"):
    for item in div:
        if item.string and item.string.strip():
            text = item.string.strip()
            if "Website" in text:
                continue
            output.append(text)

addresses = [output[i:i+7] for i in range(0, len(output), 7)]
for address in addresses:
    print(" ".join(address).replace(" ,", ","))

这让你：

Business Address: 650 County Road 375 Jarrell, TX 76537 Williamson County
Business Address: 39829 County Road 452 Leesburg, FL 32788 Lake County
Business Address: 28 County Road 884 Rainsville, AL 35986 DeKalb County

【讨论】：