【问题标题】:Beautifulsoup can't scrape an elementBeautifulsoup 无法抓取元素
【发布时间】:2020-09-28 15:47:34
【问题描述】:

您好,我试图抓取以下网站:https://www.footlocker.co.uk/en/all/new/

我想抓取以下元素的价格和'href':

<span class=" fl-price--sale ">
    <meta itemprop="priceCurrency" content="GBP">
    <meta itemprop="price" content="84.99"><span>£ 84,99</span>
</span>

还有这个(href):

<a href="https://www.footlocker.co.uk/en/p/adidas-performance-don-issue-2-men-shoes-92815?v=314102617504#!searchCategory=all" data-product-click-link="314102617504" data-hash-key="searchCategory" data-hash-url="https://www.footlocker.co.uk/en/p/adidas-performance-don-issue-2-men-shoes-92815?v=314102617504" data-testid="fl-product-details-link-314102617504">

我试过这段代码:

import urllib.request
import bs4 as bs
from bs4 import BeautifulSoup
import requests

proxies = {'type':'ip:port'}

r= requests.get('https://www.footlocker.de/de/alle/new/', proxies=proxies)

soup = BeautifulSoup(r.content,'html.parser')

# It don't find it...
for a in (soup.find_all('a')):
    try:
        if a['href'] == 'https://www.footlocker.co.uk/en/p/adidas-performance-don-issue-2-men-shoes-92815?v=314102617504#!searchCategory=all':
            print(a['href'])
    except:
        pass
# It don't find it...
for price in (soup.find_all('span', class_=' fl-price--sale ')):
    print(price.text)

我尝试使用代理进行抓取,但他拒绝抓取元素(我认为 HTML 不正确)

感谢您的建议 :-)(仅用于教育建议)

【问题讨论】:

  • 你确定' fl-price--sale '应该在开头和结尾有空格吗?
  • 是的,我有没有空格的检查,你可以在链接上检查。
  • 另外,requests.get() 不处理 javascript。如果页面具有动态创建您正在寻找的元素的 javascript,那么 requests 将不适合您。
  • 我要如何抓取 javascript 动态元素?
  • 你必须使用像真正的浏览器一样工作的东西,比如 Selenium。

标签: python web-scraping beautifulsoup


【解决方案1】:

要获取产品的名称、链接和价格,您可以使用以下示例:

import requests
from bs4 import BeautifulSoup


url = 'https://www.footlocker.co.uk/INTERSHOP/web/FLE/Footlocker-Footlocker_GB-Site/en_GB/-/GBP/ViewStandardCatalog-ProductPagingAjax?SearchParameter=____&sale=new&MultiCategoryPathAssignment=all&PageNumber={}'

for page in range(3):  # <--- increase the number of pages here
    print('Page {}...'.format(page))
    data = requests.get(url.format(page)).json()
    soup = BeautifulSoup(data['content'], 'html.parser')

    for d in soup.select('[data-request]'):
        s = BeautifulSoup(requests.get(d['data-request']).json()['content'], 'html.parser')
        
        print(s.select_one('[itemprop="name"]').text)
        print(s.select_one('[itemprop="price"]')['content'], s.select_one('[itemprop="priceCurrency"]')['content'])
        print(s.a['href'])
        print('-' * 80)

打印:

Page 0...
adidas Performance Don Issue 2 - Men Shoes
84.99 GBP
https://www.footlocker.co.uk/en/p/adidas-performance-don-issue-2-men-shoes-92815?v=314102617504
--------------------------------------------------------------------------------
Nike Air Force 1 Crater - Women Shoes
94.99 GBP
https://www.footlocker.co.uk/en/p/nike-air-force-1-crater-women-shoes-98071?v=315349054502
--------------------------------------------------------------------------------
Jordan Jumpmcn Cl Iii Camo - Baby Tracksuits
39.99 GBP
https://www.footlocker.co.uk/en/p/jordan-jumpmcn-cl-iii-camo-baby-tracksuits-91611?v=318280390044
--------------------------------------------------------------------------------
Jordan 13 Retro - Grade School Shoes
99.99 GBP
https://www.footlocker.co.uk/en/p/jordan-13-retro-grade-school-shoes-952?v=316701533404
--------------------------------------------------------------------------------

...and so on.

【讨论】:

  • 你是最棒的
  • 但是你在哪里找到的网址?
  • @MatteoBianchi 如果您点击Load More,那么您会在 Chrome/Firefox 开发人员工具(网络选项卡)中看到 URL
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2016-04-05
  • 2023-03-16
  • 1970-01-01
  • 1970-01-01
  • 2023-03-09
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多