【问题标题】:web scraping for more pages网页抓取以获取更多页面
【发布时间】:2020-07-06 08:41:14
【问题描述】:

目前,我正在为一个网站进行网络抓取,当页面自动加载时,我需要在该网站上获取数据。我正在使用 BeautifullSoup 和请求。

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.monki.com/en/newin/view-all-new.html")
soup = BeautifulSoup(page.content, 'html.parser')
article_codes=[] 
for k in soup.findAll('div',attrs={"class":"producttile-details"}):
    article_code = k.find('span', attrs={'class':"articleCode"})
    print(article_code)

    article_codes.append(article_code.text) 

使用此代码,我只获取页面的数据,但我想要页面加载后的所有数据。

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup python-requests


    【解决方案1】:

    该页面正在使用 JavaScript 加载其他页面。您可以使用requests 模块来模拟这些请求。

    例如:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://www.monki.com/en_eur/newin/view-all-new/_jcr_content/productlisting.products.html'
    headers = {
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
    }
    
    with requests.session() as s:
        s.get('https://www.monki.com/en_eur/newin/view-all-new.html', headers=headers).text
    
        for page in range(0, 10):  # <-- adjust to required number of pages
            soup = BeautifulSoup(s.get(url, params={'offset': page*28}, headers=headers).content, 'html.parser')
    
            for product in soup.select('.o-product'):
                name = product.select_one('.product-name').get_text(strip=True)
                price = product.select_one('.price-tag').get_text(strip=True)
                link = product.select_one('.a-link')['href']
    
                print('{:<50} {:<10} {}'.format(name, price , link))
    

    打印所有产品:

    NEW! Maxi smock dress                              €30        https://www.monki.com/en_eur/clothing/dresses/midi-dresses/product.midi-button-up-shirt-dress-black.0871799004.html
    NEW! Retro skater dress                            €20        https://www.monki.com/en_eur/clothing/dresses/mini-dresses/product.retro-skater-dress-white.0688447029.html
    NEW! Mozik block jeans                             €40        https://www.monki.com/en_eur/clothing/jeans/product.mozik-block-jeans-blue.0874088001.html
    NEW! Pack of two scrunchies                        €6         https://www.monki.com/en_eur/accessories/hair-accessories/product.pack-of-two-scrunchies-beige.0530296078.html
    NEW! Mini hand bag                                 €18        https://www.monki.com/en_eur/accessories/bags,-wallets-belts/bags/product.mini-hand-bag-black.0826291006.html
    NEW! Fitted crop top                               €10        https://www.monki.com/en_eur/clothing/tops/t-shirts/product.fitted-crop-top-purple.0906440002.html
    NEW! Tiered smock dress                            €30        https://www.monki.com/en_eur/clothing/dresses/midi-dresses/product.tiered-smock-dress-blue.0895277004.html
    NEW! Mini hand bag                                 €18        https://www.monki.com/en_eur/accessories/bags,-wallets-belts/bags/product.mini-hand-bag-beige.0826291008.html
    NEW! Fitted t-shirt                                €10        https://www.monki.com/en_eur/clothing/tops/t-shirts/product.fitted-t-shirt-purple.0905746002.html
    NEW! Shoulder pads t-shirt dress                   €25        https://www.monki.com/en_eur/clothing/dresses/mini-dresses/product.shoulder-pads-t-shirt-dress-beige.0929301002.html
    NEW! Yoko mid blue jeans                           €40        https://www.monki.com/en_eur/clothing/jeans/product.yoko-mid-blue-jeans-blue.0656425001.html
    NEW! Yoko classic blue jeans                       €40        https://www.monki.com/en_eur/clothing/jeans/product.yoko-classic-blue-jeans-blue.0807218001.html
    NEW! Pleated midi skirt                            €25        https://www.monki.com/en_eur/clothing/skirts/midi-skirts/product.pleated-midi-skirt-black.0562278003.html
    
    ... and so on.
    

    【讨论】:

    • 但这仅适用于 10 页。我想要它的最后一页。
    • @WweCena 然后将数字 10 增加到更大的数字。
    猜你喜欢
    • 1970-01-01
    • 2020-09-13
    • 1970-01-01
    • 1970-01-01
    • 2021-12-02
    • 2021-03-10
    • 1970-01-01
    • 2020-06-18
    相关资源
    最近更新 更多