【发布时间】:2022-01-05 02:49:30
【问题描述】:
我正在尝试获取此页面上所有书籍的所有图片网址 https://www.nb.co.za/en/books/0-6-years 和漂亮的汤。
这是我的代码:
from bs4 import BeautifulSoup
import requests
baseurl = "https://www.nb.co.za/"
productlinks = []
r = requests.get(f'https://www.nb.co.za/en/books/0-6-years')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('div', class_="book-slider-frame")
def my_filter(tag):
return (tag.name == 'a' and
tag.parent.name == 'div' and
'img-container' in tag.parent['class'])
for item in productlist:
for link in item.find_all(my_filter, href=True):
productlinks.append(baseurl + link['href'])
cover = soup.find_all('div', class_="img-container")
print(cover)
这是我的输出:
<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>
我希望得到什么:
https://www.nb.co.za/en/helper/ReadImage/25929.jpg
我的问题是:
-
如何仅获取数据源?
-
如何获取图片的扩展名?
【问题讨论】:
标签: python web-scraping beautifulsoup