【问题标题】:How to optimize scraping a dynamically loaded site?如何优化抓取动态加载的网站?
【发布时间】:2021-04-17 03:02:53
【问题描述】:

我正在尝试使用 Python 收集https://www.flightclub.com/ 上的所有鞋子。由于该站点是动态加载的,因此我使用的是 selenium Web 驱动程序。这样做的问题是加载页面和运行需要很长时间。有没有办法优化这段代码,让我不用运行time.sleep(5)来等待页面加载,这样代码运行起来会快很多?

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

#url of the page we want to scrape
url = "https://www.flightclub.com/adidas/adidas-yeezy"
driver = webdriver.Chrome(executable_path=r'.\ChromeDriver\chromedriver_win32\chromedriver.exe')
result = []
for i in range(1, 15):
    temp = []
    # initiating the webdriver. Parameter includes the path of the webdriver.
    driver.get(url+ "?page="+str(i))

    # this is just to ensure that the page is loaded
    time.sleep(5)
    html = driver.page_source
    soup = BeautifulSoup(html)

    temp = soup.find_all('a', class_='sc-12adlsx-0 iSXeRZ')
    result.extend(temp)
    print("Result len: "+ str(len(result)))

shoes = []
for res in result:
    try:
        print("------------------------------------------------------------------")
        print("Title: "+res.find('img', class_='sc-htpNat ipJcZu')['alt'])
        print("Price: "+str(res.find('div', class_='yszfz8-5 kbsRqK').text.split()[0]) +  " USD")
        print("Picture: "+res.find('img', class_='sc-htpNat ipJcZu')['src'])
        print("Link: "+"https://www.flightclub.com" + res.get('href'))
    except:
        print("Shoe not found")
print(f"\nFound total shoes: {len(result)}")
driver.quit()

【问题讨论】:

  • Selenium 已经有一个动态 HTML 解析器。我没有看到使用 bs4 的意义。只需使用 Selenium 进行选择。

标签: python python-3.x selenium-webdriver web-scraping python-requests


【解决方案1】:

您可以使用Explicit Waits

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Waits a maximum of 10 seconds, or until that element is found on the page
element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))
    )

【讨论】:

    【解决方案2】:

    回答您的问题,删除 time.sleep 将使您的代码处理速度更快,但您应该查看网站是否足够响应(这意味着速度取决于网站响应/加载的速度),以及请记住,在抓取时,糟糕的互联网连接也可能会影响代码的速度。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-04-06
      • 1970-01-01
      • 2018-08-09
      • 1970-01-01
      • 2010-09-17
      • 2020-10-02
      相关资源
      最近更新 更多