【问题标题】:How to scrape website if it has load more button to load more content on the page?如果网站有加载更多按钮以在页面上加载更多内容,如何抓取网站?
【发布时间】:2021-10-17 21:24:32
【问题描述】:
from selenium import webdriver
import time

driver = webdriver.Chrome(executable_path=r'C:\Users\gkhat\Downloads\chromedriver.exe')
driver.get('https://www.allrecipes.com/recipes/233/world-cuisine/asian/indian/')
card_titles = driver.find_elements_by_class_name('card__detailsContainer')
button = driver.find_element_by_id('category-page-list-related-load-more-button')
for card_title in card_titles:
    rname = card_title.find_element_by_class_name('card__title').text
    print(rname)

    time.sleep(3)
    driver.execute_script("arguments[0].scrollIntoView(true);", button)
    driver.execute_script("arguments[0].click();", button)
    time.sleep(3)

driver.quit()

website 在点击“加载更多”按钮后加载食物卡上面的代码刮掉了食谱标题我希望它在点击加载更多按钮后继续刮掉标题。 我尝试通过单击 XHR 转到“网络”选项卡,但没有一个请求显示 JSON。我该怎么办?

【问题讨论】:

  • 如果你使用 selenium,这是第二慢的刮东西的方法(在手动完成之后),那么你可以找到按钮的 ID。另一种解决方案是在单击按钮时检查浏览器网络选项卡中的 json 响应。在请求中查找参数:可能有一个 page 变量可以更改 - 如果有这样的变量,您可以发出一个简单的 HTTP 请求来获取 json 并在 for 循环中增加页码。跨度>
  • @mama - 页面变量显示空白响应。 - 我确实找到了按钮的 id,但我不知道如何进一步做什么或如何循环,以便在单击按钮后可以继续抓取

标签: python json selenium web-scraping python-requests


【解决方案1】:

我为此尝试了下面的代码。它有效,但我不确定这是否是最好的方法。仅供参考,我手动处理了email 的这些弹出窗口。你需要找到一种方法来处理它们。

from selenium import webdriver
import time
from selenium.common.exceptions import StaleElementReferenceException

driver = webdriver.Chrome(executable_path="path")
driver.maximize_window()
driver.implicitly_wait(10)
driver.get("https://www.allrecipes.com/recipes/233/world-cuisine/asian/indian/")
receipes = driver.find_elements_by_class_name("card__detailsContainer")
for rec in receipes:
    name = rec.find_element_by_tag_name("h3").get_attribute("innerText")
    print(name)
loadmore = driver.find_element_by_id("category-page-list-related-load-more-button")
j = 0
try:
    while loadmore.is_displayed():
        loadmore.click()
        time.sleep(5)
        lrec = driver.find_elements_by_class_name("recipeCard__detailsContainer")
        newlist = lrec[j:]
        for rec in newlist:
            name = rec.find_element_by_tag_name("h3").get_attribute("innerText")
            print(name)
        j = len(lrec)+1
        time.sleep(5)
except StaleElementReferenceException:
    pass
driver.quit()

【讨论】:

    【解决方案2】:

    其实有一个json可以返回数据。但是 json 以 html 的形式返回它,所以只需要解析它。

    注意:您可以更改块大小,这样每个“页面”可以获得超过 24 个项目

    import requests
    from bs4 import BeautifulSoup
    
    size = 24
    page = 0
    
    hasNext = True
    while hasNext == True:
        page +=1
        print('\tPage: %s' %page)
        url = 'https://www.allrecipes.com/element-api/content-proxy/aggregate-load-more?sourceFilter%5B%5D=alrcom&id=cms%2Fonecms_posts_alrcom_2007692&excludeIds%5B%5D=cms%2Fallrecipes_recipe_alrcom_142967&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_231026&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_247233&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_246179&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_256599&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_247204&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_34591&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_245131&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_220560&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_212721&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_236563&excludeIds%5B%5D=cms%2Fallrecipes_recipe_alrcom_14565&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_8189766&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_8188886&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_8189135&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_2052087&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_7986932&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_2040338&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_280310&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_142967&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_14565&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_228957&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_46822&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_72349&page={page}&orderBy=Popularity30Days&docTypeFilter%5B%5D=content-type-recipe&docTypeFilter%5B%5D=content-type-gallery&size={size}&pagesize={size}&x-ssst=iTv629LHnNxfbQ1iVslBTZJTH69zVWEa&variant=food'.format(size=size, page=page)
        jsonData = requests.get(url).json()
        
        hasNext = jsonData['hasNext']
    
        soup = BeautifulSoup(jsonData['html'], 'html.parser')
        cardTitles = soup.find_all('h3',{'class':'recipeCard__title'})
        for title in cardTitles:
            print(title.text.strip())
            
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-09-26
      • 1970-01-01
      • 1970-01-01
      • 2018-07-06
      • 2022-09-24
      • 1970-01-01
      • 2021-11-20
      相关资源
      最近更新 更多