【问题标题】:Unable to scrape ajax loaded elements on a webpage python无法在网页 python 上抓取 ajax 加载的元素
【发布时间】:2019-08-07 05:47:09
【问题描述】:

我需要抓取链接为here 的网页 在这个网页中,有一个我想抓取的交叉引用部分但是当我使用 python 请求通过以下代码收集页面内容时:

url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

生成的内容没有交叉引用部分,可能因为它没有加载。我可以抓取其余的 html 内容,但不能抓取交叉引用部分。现在,当我用 selenium 做同样的事情时,它工作得很好,这意味着 selenium 能够在加载后找到这个元素。 有人可以指导我如何使用 python 请求和 beautifulsoup 而不是 selenium 来完成这项工作吗?

【问题讨论】:

    标签: ajax selenium web-scraping beautifulsoup python-requests


    【解决方案1】:

    数据是通过Javascript加载的,但你可以用requestsBeautifulSoupjson模块提取数据:

    import json
    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'
    
    headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
        }
    
    soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')
    
    t = soup.select_one('#arrow-state').text
    t = t.replace('&q;', '"').replace('&g;', ">").replace('&l;', "<").replace('&a;', "&")
    data = json.loads( t )
    
    d = None
    for item in data['jss']['sitecore']['route']['placeholders']['arrow-main']:
        if item['componentName'] == 'PdpWrapper':
            d = item
            break
    
    if d:
        cross_reverence_product_tiles = d['placeholders']['product-details'][0]['fields']['crossReferenceProductTilesCollection']['crossReverenceProductTiles']['productTiles']
        print(json.dumps(cross_reverence_product_tiles, indent=4))
    

    打印:

    [
        {
            "partId": "16571604",
            "partNumber": "CGB3B1X5R1A475M055AC",
            "productDetailUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
            "productDetailShareUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
            "productImage": "https://static5.arrow.com/pdfs/2017/4/18/7/26/14/813/tdk_/manual/010101_lowprofile_pi0402.jpg",
            "manufacturerName": "TDK",
            "productLineTitle": "Capacitor Ceramic Multilayer",
            "productDescription": "Cap Ceramic 4.7uF 10V X5R 20% Pad SMD 0603 85\u00b0C T/R",
            "datasheetUrl": "",
            "lowestPrice": 0.0645,
            "lowestPriceFormatted": "$0.0645",
            "highestPrice": 0.3133,
            "highestPriceFormatted": "$0.3133",
            "stockFormatted": "1,875",
            "stock": 1875,
            "attributes": [],
            "buyingOptionType": "AddToCart",
            "numberOfAttributesToShow": 1,
            "rrClickTrackingUrl": null,
            "pricingDataPopulated": true,
            "sourcePartId": "V72:2272_06586404",
            "sourceCode": "ACNA",
            "packagingType": "Cut Strip",
            "unitOfMeasure": "",
            "isDiscontinued": false,
            "productTileHint": null,
            "tileSize": 1,
            "tileType": "1x1",
            "suplementaryClasses": "u-height"
        },
    
    ...and so on.
    

    【讨论】:

    • 谢谢哥们。我明白了,但你能解释一下替换部分吗?感谢回复
    • @A.Hamza 当您执行print( soup.select_one('#arrow-state').text ) 时,您会看到文本已被编码——在json 模块可以解析它之前,引号(&amp;q;&amp;g; 等) ) 需要替换为各自的字符。
    【解决方案2】:

    Selenium 就足以抓取 Cross References 部分为visibility_of_all_elements_located() 引入WebDriverWait,您可以使用关注Locator Strategies

    • 使用CSS_SELECTOR

        print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.WideSidebarProductList-list h4")))])
      
    • 使用XPATH

        print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='WideSidebarProductList-list']//h4")))])
      
    • 注意:您必须添加以下导入:

        from selenium.webdriver.support.ui import WebDriverWait
        from selenium.webdriver.common.by import By
        from selenium.webdriver.support import expected_conditions as EC
      
    • 控制台输出:

        ['CGB3B1X5R1A475M055AC', 'CL10A475MP8NNNC', 'GRM185R61A475ME11D', 'C0603C475M8PACTU']
      

    【讨论】:

    • 正如我在问题中提到的,我已经使用 selenium 完成了它。我想用 requests 和 beautifulsoup 来做。还是谢谢。
    猜你喜欢
    • 2019-03-25
    • 2015-12-14
    • 1970-01-01
    • 1970-01-01
    • 2020-06-21
    • 1970-01-01
    • 2022-11-19
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多