使用 Beautfiul Soup 从 blogspot 网站中提取特定的链接组答案

【问题标题】：Using Beautfiul Soup to extract specific groups of links from a blogspot website使用 Beautfiul Soup 从 blogspot 网站中提取特定的链接组
【发布时间】：2021-02-16 16:02:38
【问题描述】：

我想在学校网站上每 7 年提取一次链接。在档案中，使用 ctrl + f "year-7" 很容易找到。不过，beautifulSoup 并不是那么容易。或者我做错了。

import requests
from bs4 import BeautifulSoup

URL = '~school URL~'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

for link in soup.find_all('a'):
    print(link.get('href'))

这为我提供了网站档案中的每个链接。对我来说重要的每个链接都差不多是这样的：

~school URL~blogspot.com/2020/10/mathematics-activity-year-x.html

我尝试将“(link.get('href'))”存储在一个变量上并在其上搜索“year-x”，但这不起作用。

关于如何搜索它的任何想法？ Blogspot 搜索是可怕的。我这样做是为了帮助贫困地区的孩子更轻松地找到他们的课程，因为它们都只是留在了下一学年的网站上，并且有数百个没有标签的链接用于不同的学年。我正在尝试至少为每个学年编制一份链接列表以帮助他们。

【问题讨论】：

所以如果我明白了，你想获得year-7的所有链接？
是的！然后在第 8 年、第 9 年进行...
你能用其他 HTML 链接编辑你的问题吗（你想要什么和不想要什么）？
它会是这样的：``` ~school URL~blogspot.com/2020/10/geography-activity-year-1.html ~school URL~blogspot.com/2020/10 /history-activity-year-3.html ~school URL~blogspot.com/2020/10/english-activity-year-8.html ``` 等等。这一切都混在一起了。
实际的 HTML 标记是什么？

标签： python beautifulsoup automation

【解决方案1】：

如果我理解，您想从链接中提取年份。尝试使用regex 提取年份。

在你的情况下是：

import re
from bs4 import BeautifulSoup

txt = """<a href="blogspot.com/2020/10/mathematics-activity-year-x.html"</a>"""
soup = BeautifulSoup(txt, "html.parser")

years = []

for tag in soup.find_all("a"):
    link = tag.get("href")
    year = re.search(r"year-.?", link).group()
    years.append(year)

print(years)

输出：

['year-x']

编辑尝试使用 CSS 选择器选择所有以 year-7.html 结尾的 href

...
for tag in soup.select('a[href$="year-7.html"]'):
        print(tag)

【讨论】：

对不起，如果我没有说得很清楚，我会尝试改写：在我打印网站存档的每个URL之后，我有每年的活动，像这样：~URL~.blogspot.com /2020/10/math-activity-year-2.html ~URL~.blogspot.com/2020/10/math-activity-year-9.html ~URL~.blogspot.com/2020/10/math-activity -year-5.html 乘以数百...很多链接。所以我想搜索每个包含“第 7 年”的 URL，这样我就可以在某个地方收集所有第 7 年的链接。我该怎么做？
我设法使用 pandas 完成了这个项目，但感谢您与我一起尝试！