Web Scraping with Beautiful soup 多个重复标签答案

【问题标题】：Web Scraping with Beautiful soup multiple duplicate tagsWeb Scraping with Beautiful soup 多个重复标签
【发布时间】：2026-01-09 12:10:01
【问题描述】：

这是我第一次进行网络抓取，我正在关注tutorial。我正在使用这个website 来抓取信息。我正在尝试获取“89426 Green Mountain Road, Astoria, OR 97103。电话：503-325-9720”的文字。我注意到我的div class_=alert 标签中有多个ul 和li 标签。所以我不知道如何抓住一个特定的。这是我尝试过的，但继续从另一组ul/li 获得不同的文本。

from bs4 import BeautifulSoup
import requests

source = requests.get('https://www.pickyourownchristmastree.org/ORxmasnw.php').text

soup = BeautifulSoup(source, 'lxml')

noble_ridge = soup.find('div', class_='alert')
information = noble_ridge.ul.li.text
print(information)
# print(soup.prettify())


C:\Users\name\anaconda3\envs\Scraping\python.exe C:/Users/name/PycharmProjects/Scraping/Christmas_tree_farms.py
If the name of the farm is blue with an underline; that's a link to their website. Click on it for the most current hours and information.

Process finished with exit code 0

【问题讨论】：

您好 - 亲爱的 Zman3 美好的一天 - 非常感谢您拿起 Curey Schafer 的刮刀。太棒了——太棒了。继续做伟大的工作 - 它摇滚

标签： web-scraping beautifulsoup pycharm

【解决方案1】：

import requests
from bs4 import BeautifulSoup


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    target = soup.select_one("span.farm")
    goal = list(target.next_elements)[5].rsplit(" ", 2)[0]
    print(goal)


main("https://www.pickyourownchristmastree.org/ORxmasnw.php")

输出：

89426 Green Mountain Road, Astoria, OR 97103. Phone: 503-325-9720.

使用re：

import requests
import re


def main(url):
    r = requests.get(url)
    match = [item.group(1) for item in re.finditer(r'>(\d.+\d{4})\.', r.text)]
    print(match[0])


main("https://www.pickyourownchristmastree.org/ORxmasnw.php")

输出：

89426 Green Mountain Road, Astoria, OR 97103. Phone: 503-325-9720

【讨论】：

抱歉，我遇到了你的问题，正在寻找更像@atinjanki 的东西。因为我是新手，所以我想保持它接近教程。
@Zman3 你有哪些问题？我已经运行了这两个代码，没有任何问题。
嗯，一定是打错了。这样可行。我对它的运行方式以及如何修改它有点困惑。你介意我们在房间里聊天吗？
@Zman3 chat.*.com/rooms/212113/python-scraping

【解决方案2】：

noble_ridge里面有很多ul标签。

使用

noble_ridge.ul

将您带到找到的第一个 ul 标记。见下图——

您预期的文本 - “89426 Green Mountain Road, Astoria, OR 97103. Phone: 503-325-9720” 在下一个 ul 标签下。

因此，如果你想直接遍历那里，你可以使用 -

noble_ridge.findAll('ul')[1].li.text

或者您可以遍历所有标签并查找您的文本，例如 -

source = requests.get('https://www.pickyourownchristmastree.org/ORxmasnw.php').text

soup = BeautifulSoup(source, 'lxml')

noble_ridge = soup.find('div', class_='alert')

ultags =  noble_ridge.findAll('ul')

temp = '89426 Green Mountain Road, Astoria, OR 97103. Phone: 503-325-9720'

for tag in ultags:
    litags =  tag.findAll('li')
    #print(litags)
    for li in litags:
        tx = li.getText()
        #print(tx)
        if tx.find(temp)>-1:

            print(tag)

这将为您提供包含文本的 ul 标签。

【讨论】：

您能解释一下-1 的情况吗？
@Zman3 tx.find(temp) 正在 tx 中查找 temp(substring)。 find() 函数在未找到指定的子字符串时返回 -1。在这里阅读更多 - docs.python.org/2/library/string.html#string.find
好吧，这是有道理的。谢谢！