从多个网页中抓取 url答案

【问题标题】：Scraping urls from multiple webpages从多个网页中抓取 url
【发布时间】：2020-05-28 11:15:30
【问题描述】：

我正在尝试从多个网页中提取 URL（在本例中为 2），但由于某种原因，我的输出是从第一页中提取的 URL 的重复列表。我做错了什么？

我的代码：

# URLs of books in scope
urls = []
for pn in range(2):
    baseUrl = 'https://www.goodreads.com'
    path = '/shelf/show/bestsellers?page='+str(pn+1)
    page = requests.get(baseUrl + path).text
    print(baseUrl+path)
    soup = BeautifulSoup(page, "html.parser")
    for link in soup.findAll('a',attrs={'class':"leftAlignedImage"}):
        if link['href'].startswith('/author/show/'):
            pass
        else:
            u=baseUrl+link['href']
            urls.append(u)
for u in urls:
    print(u)

输出：

https://www.goodreads.com/shelf/show/bestsellers?page=1
https://www.goodreads.com/shelf/show/bestsellers?page=2
https://www.goodreads.com/book/show/5060378-the-girl-who-played-with-fire
https://www.goodreads.com/book/show/968.The_Da_Vinci_Code
https://www.goodreads.com/book/show/4667024-the-help
https://www.goodreads.com/book/show/2429135.The_Girl_with_the_Dragon_Tattoo
https://www.goodreads.com/book/show/3.Harry_Potter_and_the_Sorcerer_s_Stone
.
.
.
https://www.goodreads.com/book/show/4588.Extremely_Loud_Incredibly_Close
https://www.goodreads.com/book/show/36809135-where-the-crawdads-sing
.
.
.
https://www.goodreads.com/book/show/4588.Extremely_Loud_Incredibly_Close
https://www.goodreads.com/book/show/36809135-where-the-crawdads-sing

【问题讨论】：

attrs={'class':"elementList",'class':"leftAlignedImage"} 这看起来很可疑。 Python dict 不能包含相同的键。
两次您都获得相同的页面。 page=2 URL 参数不做任何事情，只是加载相同的页面。
@AndrejKesely 这不是问题，但你说得对，我忘了去掉第一堂课，只是编辑了它。问题仍然是，即使请求在第二个循环中获取另一个 URL，它仍然可以使用第一个 URL。

标签： html python-3.x web-scraping beautifulsoup

【解决方案1】：

您会收到重复的网址，因为您两次加载的是同一个页面。如果您未登录，即使您设置了page=2，该网站仅显示畅销书的首页。

要解决此问题，您必须修改代码以在加载页面之前先登录，或者传递必须从已登录浏览器导入的 cookie。

【讨论】：

好收获！没想到未登录时只显示第一页。我去看看。