【发布时间】:2020-05-28 11:15:30
【问题描述】:
我正在尝试从多个网页中提取 URL(在本例中为 2),但由于某种原因,我的输出是从第一页中提取的 URL 的重复列表。我做错了什么?
我的代码:
# URLs of books in scope
urls = []
for pn in range(2):
baseUrl = 'https://www.goodreads.com'
path = '/shelf/show/bestsellers?page='+str(pn+1)
page = requests.get(baseUrl + path).text
print(baseUrl+path)
soup = BeautifulSoup(page, "html.parser")
for link in soup.findAll('a',attrs={'class':"leftAlignedImage"}):
if link['href'].startswith('/author/show/'):
pass
else:
u=baseUrl+link['href']
urls.append(u)
for u in urls:
print(u)
输出:
https://www.goodreads.com/shelf/show/bestsellers?page=1
https://www.goodreads.com/shelf/show/bestsellers?page=2
https://www.goodreads.com/book/show/5060378-the-girl-who-played-with-fire
https://www.goodreads.com/book/show/968.The_Da_Vinci_Code
https://www.goodreads.com/book/show/4667024-the-help
https://www.goodreads.com/book/show/2429135.The_Girl_with_the_Dragon_Tattoo
https://www.goodreads.com/book/show/3.Harry_Potter_and_the_Sorcerer_s_Stone
.
.
.
https://www.goodreads.com/book/show/4588.Extremely_Loud_Incredibly_Close
https://www.goodreads.com/book/show/36809135-where-the-crawdads-sing
.
.
.
https://www.goodreads.com/book/show/4588.Extremely_Loud_Incredibly_Close
https://www.goodreads.com/book/show/36809135-where-the-crawdads-sing
【问题讨论】:
-
attrs={'class':"elementList",'class':"leftAlignedImage"}这看起来很可疑。 Python dict 不能包含相同的键。 -
两次您都获得相同的页面。
page=2URL 参数不做任何事情,只是加载相同的页面。 -
@AndrejKesely 这不是问题,但你说得对,我忘了去掉第一堂课,只是编辑了它。问题仍然是,即使请求在第二个循环中获取另一个 URL,它仍然可以使用第一个 URL。
标签: html python-3.x web-scraping beautifulsoup