无法在网页 python 上抓取 ajax 加载的元素答案

【问题标题】：Unable to scrape ajax loaded elements on a webpage python无法在网页 python 上抓取 ajax 加载的元素
【发布时间】：2019-08-07 05:47:09
【问题描述】：

我需要抓取链接为here 的网页在这个网页中，有一个我想抓取的交叉引用部分但是当我使用 python 请求通过以下代码收集页面内容时：

url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

生成的内容没有交叉引用部分，可能因为它没有加载。我可以抓取其余的 html 内容，但不能抓取交叉引用部分。现在，当我用 selenium 做同样的事情时，它工作得很好，这意味着 selenium 能够在加载后找到这个元素。有人可以指导我如何使用 python 请求和 beautifulsoup 而不是 selenium 来完成这项工作吗？

【问题讨论】：

标签： ajax selenium web-scraping beautifulsoup python-requests

【解决方案1】：

数据是通过Javascript加载的，但你可以用requests、BeautifulSoup和json模块提取数据：

import json
import requests
from bs4 import BeautifulSoup

url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'

headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
    }

soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')

t = soup.select_one('#arrow-state').text
t = t.replace('&q;', '"').replace('&g;', ">").replace('&l;', "<").replace('&a;', "&")
data = json.loads( t )

d = None
for item in data['jss']['sitecore']['route']['placeholders']['arrow-main']:
    if item['componentName'] == 'PdpWrapper':
        d = item
        break

if d:
    cross_reverence_product_tiles = d['placeholders']['product-details'][0]['fields']['crossReferenceProductTilesCollection']['crossReverenceProductTiles']['productTiles']
    print(json.dumps(cross_reverence_product_tiles, indent=4))

打印：

[
    {
        "partId": "16571604",
        "partNumber": "CGB3B1X5R1A475M055AC",
        "productDetailUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
        "productDetailShareUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
        "productImage": "https://static5.arrow.com/pdfs/2017/4/18/7/26/14/813/tdk_/manual/010101_lowprofile_pi0402.jpg",
        "manufacturerName": "TDK",
        "productLineTitle": "Capacitor Ceramic Multilayer",
        "productDescription": "Cap Ceramic 4.7uF 10V X5R 20% Pad SMD 0603 85\u00b0C T/R",
        "datasheetUrl": "",
        "lowestPrice": 0.0645,
        "lowestPriceFormatted": "$0.0645",
        "highestPrice": 0.3133,
        "highestPriceFormatted": "$0.3133",
        "stockFormatted": "1,875",
        "stock": 1875,
        "attributes": [],
        "buyingOptionType": "AddToCart",
        "numberOfAttributesToShow": 1,
        "rrClickTrackingUrl": null,
        "pricingDataPopulated": true,
        "sourcePartId": "V72:2272_06586404",
        "sourceCode": "ACNA",
        "packagingType": "Cut Strip",
        "unitOfMeasure": "",
        "isDiscontinued": false,
        "productTileHint": null,
        "tileSize": 1,
        "tileType": "1x1",
        "suplementaryClasses": "u-height"
    },

...and so on.

【讨论】：

谢谢哥们。我明白了，但你能解释一下替换部分吗？感谢回复
@A.Hamza 当您执行print( soup.select_one('#arrow-state').text ) 时，您会看到文本已被编码——在json 模块可以解析它之前，引号（&q;、&g; 等） ) 需要替换为各自的字符。

【解决方案2】：

Selenium 就足以抓取 Cross References 部分为visibility_of_all_elements_located() 引入WebDriverWait，您可以使用关注Locator Strategies：

使用CSS_SELECTOR：

  print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.WideSidebarProductList-list h4")))])

使用XPATH：

  print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='WideSidebarProductList-list']//h4")))])

注意：您必须添加以下导入：

  from selenium.webdriver.support.ui import WebDriverWait
  from selenium.webdriver.common.by import By
  from selenium.webdriver.support import expected_conditions as EC

控制台输出：

  ['CGB3B1X5R1A475M055AC', 'CL10A475MP8NNNC', 'GRM185R61A475ME11D', 'C0603C475M8PACTU']

【讨论】：

正如我在问题中提到的，我已经使用 selenium 完成了它。我想用 requests 和 beautifulsoup 来做。还是谢谢。