使用 Selenium 和 Python 进行网络抓取时出现问题答案

【问题标题】：Issue while web scraping with Selenium and Python使用 Selenium 和 Python 进行网络抓取时出现问题
【发布时间】：2020-02-22 21:17:38
【问题描述】：

我正在尝试抓取这个网站

https://maroof.sa/BusinessType/BusinessesByTypeList?bid=14&sortProperty=BestRating&DESC=True 有一个按钮可以在单击时加载更多内容，它会在不更改 URL 的情况下显示更多内容我编写了一段代码先加载所有内容，然后提取我需要的所有数据的 URL，然后转到每个链接并抓取数据

url = "https://maroof.sa/BusinessType/BusinessesByTypeList?bid=26&sortProperty=BestRating&DESC=True"
driver = webdriver.Chrome()
driver.get(url)
# button = driver.find_element_by_xpath('//*[@id="loadMore"]/button')
num = 1
while num <= 507:
    sleep(4)
    button = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="loadMore"]/button')))
    button.click()
    print(num)
    num += 1
links = [l.get_attribute('href') for l in WebDriverWait(driver, 40).until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="list"]/a')))]

它似乎可以工作，但有时它不会单击意外加载内容的按钮单击其他内容并出错，我必须重新开始你能帮助我吗？

【问题讨论】：

如果抛出错误，只需使用 try/except。可以只初始化一个布尔值，然后循环直到为真（循环尝试，直到它不抛出错误（从而触发除外）
你尝试过使用请求吗？
不，我没有尝试使用请求，但我现在会使用

标签： python selenium-webdriver web-scraping lazy-loading webdriverwait

【解决方案1】：

要抓取website 点击按钮加载更多内容，您需要为element_to_be_clickable() 诱导WebDriverWait，您可以使用以下Locator Strategy：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://maroof.sa/BusinessType/BusinessesByTypeList?bid=26&sortProperty=BestRating&DESC=True')
while True:
    try:
    driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//button[@class='btn btn-primary']"))))
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[@class='btn btn-primary']"))).click()
    except TimeoutException:
    break
print([l.get_attribute('href') for l in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="list"]/a')))])
driver.quit()

【讨论】：

我会试试这个解决方案
@HosamGamal 太好了，让我知道你的执行状态。