Selenium：如何单击显示按钮，刮除hrefs，然后再次单击显示按钮？答案

【问题标题】：Selenium: how do I click show button, scrape hrefs, then click show button again?Selenium：如何单击显示按钮，刮除hrefs，然后再次单击显示按钮？
【发布时间】：2020-07-28 17:12:04
【问题描述】：

链接到我要抓取的页面：

https://www.nytimes.com/reviews/dining

因为这个页面有一个“显示更多”按钮，我需要 Selenium 自动迭代地点击“显示更多”按钮，然后以某种方式使用 Beautiful soup 来获取页面上每个餐厅评论的链接。在下面的照片中，我要收获的链接位于 https://...onigiri.html">。

到目前为止的代码：

url = "https://www.nytimes.com/reviews/dining"
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver.get(url)

for i in range(1):
  button = driver.find_element_by_tag_name("button")
  button.click()

如何使用 WebDriverWait 和 BeautifulSoup [BeautifulSoup(driver.page_source, 'html.parser')] 来完成这个任务？

【问题讨论】：

您能否更具体地说明您正在努力解决的问题？顺便说一句，你可能不需要 BeautifulSoup。
你试过什么？您是否看过其他使用 WebDriverWait 的示例？您要抓取哪些链接？您很可能只使用 Selenium 来获取它们，而根本不需要 BeautifulSoup。
@AMC 是的！我刚刚在我的问题中添加了一张照片，以进一步阐明我要抓取哪些链接。
@Code-Apprentice 我试过查看 WebDriverWait 文档——有 find_element_by_tag_name、x_path、css_selector 之类的东西，但我不太确定如何应用我在互联网解决了我的特殊问题。

标签： python selenium-webdriver web-scraping beautifulsoup

【解决方案1】：

转到https://www.nytimes.com/reviews/dining按F12然后按Ctrl+Shift+C获取元素显示更多，然后如图所示获取元素的xpath：

为了找到xpath，请看：

https://www.techbeamers.com/locate-elements-selenium-python/#locate-element-by-xpath

import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def executeTest():
    global driver
    driver.get('https://www.nytimes.com/reviews/dining')
    time.sleep(7)
    element = driver.find_element_by_xpath('Your_Xpath')
    element.click()
    time.sleep(3)

def startWebDriver():
    global driver
    options = Options()
    options.add_argument("--disable-infobars")
    driver = webdriver.Chrome(chrome_options=options)

if __name__ == "__main__":
    startWebDriver()
    executeTest()
    driver.quit()

【讨论】：

【解决方案2】：

这是一个延迟加载应用程序。要单击Show More 按钮，您需要使用infinite 循环和scroll down 要查找的页面，然后click 并等待一段时间以加载页面并然后将值存储在list中。如果匹配，则在列表之前和之后验证列表，然后退出无限循环。

代码：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import time

driver=webdriver.Chrome()
driver.get("https://www.nytimes.com/reviews/dining")
#To accept the coockie click on that
WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.XPATH,"//button[text()='Accept']"))).click()
listhref=[]

while(True):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    elements=WebDriverWait(driver,20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,"a.css-gg4vpm")))
    lenlistbefore=len(listhref)
    for ele in elements:
        if ele.get_attribute("href") in listhref:
            continue
        else:
            listhref.append(ele.get_attribute("href"))

    lenlistafter = len(listhref)

    if lenlistbefore==lenlistafter:
        break

    button=WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.XPATH,"//button[text()='Show More']")))
    driver.execute_script("arguments[0].click();", button)
    time.sleep(2)
print(len(listhref))
print(listhref)

注意：- 我正在获取列表计数499

【讨论】：

非常感谢！这适用于对这一行的调整：“WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.XPATH,"//button[text()='Accept']"))).click() listhref =[]" — 我基本上只是将 XPATH 更改为 "//button[text()='Show More']"，这是一个简单方便的修复。