Python BeautifulSoup 没有提取每个 URL答案

【问题标题】：Python BeautifulSoup not extracting every URLPython BeautifulSoup 没有提取每个 URL
【发布时间】：2022-01-09 17:58:15
【问题描述】：

我正在尝试查找此页面上的所有 URL：https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments

更具体地说，我想要每个“主题代码”下的超链接。然而，当我运行我的代码时，几乎没有任何链接被提取出来。

我想知道为什么会这样，以及如何解决它。

from bs4 import BeautifulSoup
import requests

url = "https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments"

page = requests.get(url)
soup = BeautifulSoup(page.text, features="lxml")

for link in soup.find_all('a'):
    print(link.get('href'))

这是我第一次尝试网络抓取..

【问题讨论】：

标签： python url beautifulsoup extract

【解决方案1】：

有一个反机器人保护，只需在您的标题中添加一个用户代理。出现问题时不要忘记检查你的汤

from bs4 import BeautifulSoup
import requests

url = "https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments"
ua={'User-Agent':'Mozilla/5.0 (Macintosh; PPC Mac OS X 10_8_2) AppleWebKit/531.2 (KHTML, like Gecko) Chrome/26.0.869.0 Safari/531.2'}
r = requests.get(url, headers=ua)
soup = BeautifulSoup(r.text, features="lxml")

for link in soup.find_all('a'):
    print(link.get('href'))

汤里的信息是

很抱歉给您带来不便。

我们检测到来自您的浏览器的过多或异常的网络请求，并且无法确定这些请求是否是自动的。

要进入请求的页面，请填写下面的验证码。

【讨论】：

【解决方案2】：

我会使用nth-child(1) 来限制与id 匹配的表的第一列。然后只需提取.text。如果其中包含*，则为未提供课程提供默认字符串，否则，将检索到的课程标识符连接到基本查询字符串构造：

import requests
from bs4 import BeautifulSoup as bs

headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments', headers=headers)
soup = bs(r.content, 'lxml')
no_course = ''
base = 'https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-department&dept='
course_info = {i.text:(no_course if '*' in i.text else base + i.text) for i in soup.select('#mainTable td:nth-child(1)')}
course_info

【讨论】：