如何优化抓取动态加载的网站？答案

【问题标题】：How to optimize scraping a dynamically loaded site?如何优化抓取动态加载的网站？
【发布时间】：2021-04-17 03:02:53
【问题描述】：

我正在尝试使用 Python 收集https://www.flightclub.com/ 上的所有鞋子。由于该站点是动态加载的，因此我使用的是 selenium Web 驱动程序。这样做的问题是加载页面和运行需要很长时间。有没有办法优化这段代码，让我不用运行time.sleep(5)来等待页面加载，这样代码运行起来会快很多？

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

#url of the page we want to scrape
url = "https://www.flightclub.com/adidas/adidas-yeezy"
driver = webdriver.Chrome(executable_path=r'.\ChromeDriver\chromedriver_win32\chromedriver.exe')
result = []
for i in range(1, 15):
    temp = []
    # initiating the webdriver. Parameter includes the path of the webdriver.
    driver.get(url+ "?page="+str(i))

    # this is just to ensure that the page is loaded
    time.sleep(5)
    html = driver.page_source
    soup = BeautifulSoup(html)

    temp = soup.find_all('a', class_='sc-12adlsx-0 iSXeRZ')
    result.extend(temp)
    print("Result len: "+ str(len(result)))

shoes = []
for res in result:
    try:
        print("------------------------------------------------------------------")
        print("Title: "+res.find('img', class_='sc-htpNat ipJcZu')['alt'])
        print("Price: "+str(res.find('div', class_='yszfz8-5 kbsRqK').text.split()[0]) +  " USD")
        print("Picture: "+res.find('img', class_='sc-htpNat ipJcZu')['src'])
        print("Link: "+"https://www.flightclub.com" + res.get('href'))
    except:
        print("Shoe not found")
print(f"\nFound total shoes: {len(result)}")
driver.quit()

【问题讨论】：

Selenium 已经有一个动态 HTML 解析器。我没有看到使用 bs4 的意义。只需使用 Selenium 进行选择。

标签： python python-3.x selenium-webdriver web-scraping python-requests

【解决方案1】：

您可以使用Explicit Waits。

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Waits a maximum of 10 seconds, or until that element is found on the page
element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))
    )

【讨论】：

【解决方案2】：

回答您的问题，删除 time.sleep 将使您的代码处理速度更快，但您应该查看网站是否足够响应（这意味着速度取决于网站响应/加载的速度），以及请记住，在抓取时，糟糕的互联网连接也可能会影响代码的速度。

【讨论】：