【问题标题】:Python beautiful soup retrieve image from htmlPython 美丽的汤从 html 中检索图像
【发布时间】:2020-07-19 05:49:34
【问题描述】:

我正在使用python的beautifulSoup包来抓取以下页面:https://www.nike.com/w/womens-shoes-5e1x6zy7ok

当我使用以下代码时:

data = br.open("https://www.nike.com/w/womens-shoes-5e1x6zy7ok").read()
soup = BS(data)
shoes = soup.find_all('div', {'class':'product-card__body'})

我只收到:

<picture><source media="0" srcset=""/><source media="1" srcset=""/><source media="2" srcset=""/><img alt="Nike Air Max 2090 Women's Shoe" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"/></picture>

但是,如果我直接从网站的 URL 复制,我会收到更多信息:

<picture><source srcset="product-card__body" media="(min-width: 1024px)"><source srcset="https://static.nike.com/a/images/c_limit,w_592,f_auto/t_product_v1/b2bfaf14-ed59-48a7-b8ae-e684b1d605ce/air-max-270-react-se-womens-shoe-6bhhrf.jpg" media="(max-width: 1023px) and (-webkit-min-device-pixel-ratio: 2), (min-resolution: 192dpi)"><source srcset="https://static.nike.com/a/images/c_limit,w_318,f_auto/t_product_v1/b2bfaf14-ed59-48a7-b8ae-e684b1d605ce/air-max-270-react-se-womens-shoe-6bhhrf.jpg" media="(max-width: 1023px)"><img src="https://static.nike.com/a/images/c_limit,w_318,f_auto/t_product_v1/b2bfaf14-ed59-48a7-b8ae-e684b1d605ce/air-max-270-react-se-womens-shoe-6bhhrf.jpg" alt="Nike Air Max 270 React SE Women's Shoe"></picture>

如何使用beautifulsoup获取后面的信息?

【问题讨论】:

  • 您想获取什么信息?页面上所有图像的标题?只是第一个标题?
  • 真正与每只鞋子相关的所有信息,包括鞋子的图片@Mendelg
  • 你用的是什么br.open?那些picture 元素是什么?我已经尝试过 requests + bs4 并且我从链接中得到了我应该得到的东西

标签: python html python-3.x web-scraping beautifulsoup


【解决方案1】:

数据是通过 JavaScript 从他们的 API 加载的。此脚本将在页面上打印初始产品:

import re
import json
import requests
from bs4 import BeautifulSoup


url = 'https://www.nike.com/gb/w/womens-shoes-5e1x6zy7ok'
html_data = requests.get(url).text
data = json.loads(re.search(r'window.INITIAL_REDUX_STATE=(\{.*?\});', html_data).group(1))
   
for p in data['Wall']['products']:
    print(p['title'])
    print(p['subtitle'])
    print(p['price']['currentPrice'], p['price']['currency'])
    print(p['colorways'][0]['images']['portraitURL'].replace('w_400', 'w_1920'))
    print('-' * 120)

打印:

Nike Air VaporMax 2020 FK
Women's Shoe
189.95 GBP
https://static.nike.com/a/images/c_limit,w_1920,f_auto/t_product_v1/d4452769-d6ac-4121-8f98-96f7cb9e0f68/image.jpg
------------------------------------------------------------------------------------------------------------------------
Nike Air Max 90
Women's Shoe
114.95 GBP
https://static.nike.com/a/images/c_limit,w_1920,f_auto/t_product_v1/e4182f87-d936-4052-a14a-b3c8bd161a38/image.jpg
------------------------------------------------------------------------------------------------------------------------
NikeCourt Air Zoom GP Turbo
Women's Hard Court Tennis Shoe
124.95 GBP
https://static.nike.com/a/images/c_limit,w_1920,f_auto/t_product_v1/4ec4011a-1c46-42f4-9b4b-ff99fd9592f2/image.jpg
------------------------------------------------------------------------------------------------------------------------
Nike Air Zoom SuperRep Premium
Women's HIIT Class Shoe
114.95 GBP
https://static.nike.com/a/images/c_limit,w_1920,f_auto/t_product_v1/d058f141-eebb-4578-bc87-53867c9ee173/image.jpg
------------------------------------------------------------------------------------------------------------------------

...and so on.

编辑:从所有页面打印产品:

import re
import json
import requests
from bs4 import BeautifulSoup


url = 'https://www.nike.com/gb/w/womens-shoes-5e1x6zy7ok'
html_data = requests.get(url).text
data = json.loads(re.search(r'window.INITIAL_REDUX_STATE=(\{.*?\});', html_data).group(1))

for p in data['Wall']['products']:
    print(p['title'])
    print(p['subtitle'])
    print(p['price']['currentPrice'], p['price']['currency'])
    print(p['colorways'][0]['images']['portraitURL'].replace('w_400', 'w_1920'))
    print('-' * 120)

next_page = data['Wall']['pageData']['next']
while next_page:
    u = 'https://www.nike.com' + next_page

    data = requests.get(u).json()
    for o in data['objects']:
        p = o['productInfo'][0]
        print(p['productContent']['title'])
        print(p['productContent']['subtitle'])
        print(p['merchPrice']['currentPrice'], p['merchPrice']['currency'])
        print(p['imageUrls']['productImageUrl'])
        print('-' * 120)

    next_page = data.get('pages', {'next':''})['next']

【讨论】:

【解决方案2】:

试试这个:

import requests
...
req = requests.get(<your URL>, headers={'User-Agent': <user-agent from your browser>})
if not req.ok:
    # Error
soup = BeautifulSoup(req.text)
...

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2012-12-19
    • 1970-01-01
    • 1970-01-01
    • 2015-03-27
    • 1970-01-01
    • 1970-01-01
    • 2022-10-20
    • 2012-02-17
    相关资源
    最近更新 更多