从多个 url 中提取图像答案

【问题标题】：Extracting images from multiple urls从多个 url 中提取图像
【发布时间】：2019-12-08 09:49:25
【问题描述】：

我想遍历一个 url 列表并从每个页面中提取图像。但是，在某些情况下，图像不存在并且 url 与我通常观察到的 url 模式不同。

例如，每当我遇到这样的网址时，我都会使用代码- 我收到一条错误消息

这是我写的代码：

file = pd.read_csv(path)
for index,row in file.iterrows():
    site = row['link']
    response = requests.get(site)
    soup = BeautifulSoup(response.text, 'html.parser')
    pics = soup.find('img')
    pic_url = pics['src']
    urllib.request.urlretrieve(pic_url,'C:\\Users\\User\\test\\pictures\\'+ str(site.split('/')[-1])+'.jpg')

这是我的数据示例

name            link
 one            https://boxrec.com/en/proboxer/844760
 two            https://boxrec.com/en/proboxer/838706
 three          https://boxrec.com/en/proboxer/879108
 four           https://boxrec.com/en/proboxer/745266

这是我的错误信息

ValueError：未知 url 类型：'/build/images/main/avatar.jpeg'

更新：我尝试添加尝试，除了捕获错误并继续。但是我开始收到错误消息

TypeError: 'NoneType' 对象不可下标

然后我将我的代码更新为这个

try:
         pic_url = pics['src']
except:
         image = 'https://chapters.theiia.org/central-mississippi/About/ChapterOfficers/_w/person-placeholder_jpg.jpg'
         urllib.request.urlretrieve(image,'C:\\Users\\User\\test\\pictures\\'+str(site.split('/')[-1])+'.jpg')
try:
        urllib.request.urlretrieve(pic_url,'C:\\Users\\User\\test\\pictures\\'+ str(site.split('/')[-1])+'.jpg')
except:
        image = 'https://chapters.theiia.org/central-mississippi/About/ChapterOfficers/_w/person-placeholder_jpg.jpg'
        urllib.request.urlretrieve(image,'C:\\Users\\User\\test\\pictures\\'+str(site.split('/')[-1])+'.jpg')

但这会返回多次重复，并且在某些情况下会返回空白图片作为 id 实际存在的图片

【问题讨论】：

标签： python pandas beautifulsoup

【解决方案1】：

如果您只是想避免错误并继续使用其他有效图像，您可以将其附在try: except: continue中

类似的东西

try:
    urllib.request.urlretrieve(...)
except ValueError:
    continue

【讨论】：

试过这个，但是很多图片被重复多次

【解决方案2】：

只需简单地将它放在带有 for 循环的 try/except 块中，这样每次出现异常时它都会继续到列表中的下一个内容

file = pd.read_csv(path)
for index,row in file.iterrows():
    site = row['link']
    try:
       response = requests.get(site)
       soup = BeautifulSoup(response.text, 'html.parser')
       pics = soup.find('img')
       pic_url = pics['src']
       urllib.request.urlretrieve(pic_url,'C:\\Users\\User\\test\\pictures\\'+ str(site.split('/')[-1])+'.jpg')
    except Exception:
            continue

【讨论】：

【解决方案3】：

因为'/build/images/main/avatar.jpeg'是相对路径，是可以过滤掉的默认头像，如果不想过滤掉，可以转成全路径。以下代码包含自动转换功能。下面的代码使用了库simplified_scrapy

from simplified_scrapy.simplified_doc import SimplifiedDoc 
file = pd.read_csv(path)
for index,row in file.iterrows():
    site = row['link']
    response = requests.get(site)
    doc = SimplifiedDoc(response.text)
    pics = doc.listImg(url=site)[0]
    pic_url = pics.url
    urllib.request.urlretrieve(pic_url,'C:\\Users\\User\\test\\pictures\\'+ str(site.split('/')[-1])+'.jpg')

【讨论】：