【发布时间】:2021-04-17 03:02:53
【问题描述】:
我正在尝试使用 Python 收集https://www.flightclub.com/ 上的所有鞋子。由于该站点是动态加载的,因此我使用的是 selenium Web 驱动程序。这样做的问题是加载页面和运行需要很长时间。有没有办法优化这段代码,让我不用运行time.sleep(5)来等待页面加载,这样代码运行起来会快很多?
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
#url of the page we want to scrape
url = "https://www.flightclub.com/adidas/adidas-yeezy"
driver = webdriver.Chrome(executable_path=r'.\ChromeDriver\chromedriver_win32\chromedriver.exe')
result = []
for i in range(1, 15):
temp = []
# initiating the webdriver. Parameter includes the path of the webdriver.
driver.get(url+ "?page="+str(i))
# this is just to ensure that the page is loaded
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html)
temp = soup.find_all('a', class_='sc-12adlsx-0 iSXeRZ')
result.extend(temp)
print("Result len: "+ str(len(result)))
shoes = []
for res in result:
try:
print("------------------------------------------------------------------")
print("Title: "+res.find('img', class_='sc-htpNat ipJcZu')['alt'])
print("Price: "+str(res.find('div', class_='yszfz8-5 kbsRqK').text.split()[0]) + " USD")
print("Picture: "+res.find('img', class_='sc-htpNat ipJcZu')['src'])
print("Link: "+"https://www.flightclub.com" + res.get('href'))
except:
print("Shoe not found")
print(f"\nFound total shoes: {len(result)}")
driver.quit()
【问题讨论】:
-
Selenium 已经有一个动态 HTML 解析器。我没有看到使用 bs4 的意义。只需使用 Selenium 进行选择。
标签: python python-3.x selenium-webdriver web-scraping python-requests