在 Python 中使用 Selenium 抓取 JavaScript 渲染的内容答案

【问题标题】：WebScraping JavaScript-Rendered Content using Selenium in Python在 Python 中使用 Selenium 抓取 JavaScript 渲染的内容
【发布时间】：2020-03-27 09:37:43
【问题描述】：

我对网络抓取非常陌生，并且一直在尝试使用 Selenium 的功能来模拟访问德克萨斯州公共合同网页的浏览器，然后下载嵌入式 PDF。该网站是这样的：http://www.txsmartbuy.com/sp。

到目前为止，我已经成功地使用 Selenium 在下拉菜单“机构名称”之一中选择了一个选项，然后单击了搜索按钮。我在下面列出了我的 Python 代码。

import os
os.chdir("/Users/fsouza/Desktop") #Setting up directory

from bs4 import BeautifulSoup #Downloading pertinent Python packages
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

chromedriver = "/Users/fsouza/Desktop/chromedriver" #Setting up Chrome driver
driver = webdriver.Chrome(executable_path=chromedriver)
driver.get("http://www.txsmartbuy.com/sp")
delay = 3 #Seconds

WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.XPATH, "//select[@id='agency-name-filter']/option[69]")))    
health = driver.find_element_by_xpath("//select[@id='agency-name-filter']/option[68]")
health.click()
search = driver.find_element_by_id("spBtnSearch")
search.click()

一旦我进入结果页面，我就卡住了。

首先，我无法使用 html 页面源访问任何生成的链接。但是，如果我在 Chrome 中手动检查单个链接，我确实会找到与单个结果相关的相关标签 (<a href...)。我猜这是因为 JavaScript 渲染的内容。

第二，即使 Selenium 能够看到这些单独的标签，它们也没有 class 或 id。我认为调用它们的最佳方法是按显示的顺序调用<a 标签（参见下面的代码），但这也不起作用。相反，该链接调用了其他一些“可见”标签（页脚中的东西，我不需要）。

第三，假设这些事情确实有效，我怎样才能计算出页面上显示的<a> 标签的数量（以便为每个结果循环此代码）？

driver.execute_script("document.getElementsByTagName('a')[27].click()")

感谢您对此的关注——考虑到我才刚刚开始，请原谅我的任何愚蠢行为。

【问题讨论】：

标签： python selenium-webdriver web-scraping webdriverwait window-handles

【解决方案1】：

要使用 Selenium 抓取 JavaScript 呈现的内容，您需要：

为所需的element to be clickable() 引入 WebDriverWait。
为visibility of all elements located() 引入WebDriverWait。
使用 Ctrl 和 click() 到 ActionChains 打开 new tab 中的每个链接
诱导 WebDriverWait 和 switch to the new tab 到 webscrape。
切换回主页面。

代码块：

  from selenium import webdriver
  from selenium.webdriver.common.by import By
  from selenium.webdriver.support.ui import WebDriverWait
  from selenium.webdriver.support import expected_conditions as EC
  from selenium.webdriver.common.action_chains import ActionChains
  from selenium.webdriver.common.keys import Keys
  import time

  options = webdriver.ChromeOptions() 
  options.add_argument("start-maximized")
  options.add_experimental_option("excludeSwitches", ["enable-automation"])
  options.add_experimental_option('useAutomationExtension', False)
  driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
  driver.get("http://www.txsmartbuy.com/sp")
  windows_before  = driver.current_window_handle
  WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//select[@id='agency-name-filter' and @name='agency-name']"))).click()
  WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//select[@id='agency-name-filter' and @name='agency-name']//option[contains(., 'Health & Human Services Commission - 529')]"))).click()
  WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[@id='spBtnSearch']/i[@class='icon-search']"))).click()
  for link in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table/tbody//tr/td/strong/a"))):
      ActionChains(driver).key_down(Keys.CONTROL).click(link).key_up(Keys.CONTROL).perform()
      WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
      windows_after = driver.window_handles
      new_window = [x for x in windows_after if x != windows_before][0]
      driver.switch_to_window(new_window)
      time.sleep(3)
      print("Focus on the newly opened tab and here you can scrape the page")
      driver.close()
      driver.switch_to_window(windows_before)
  driver.quit()

控制台输出：

  Focus on the newly opened tab and here you can scrape the page
  Focus on the newly opened tab and here you can scrape the page
  Focus on the newly opened tab and here you can scrape the page
  .
  .

浏览器快照：

参考文献

您可以在以下位置找到一些相关的详细讨论：

【讨论】：

这很有帮助。我现在明白发生了什么。谢谢！

【解决方案2】：

要在结果中获取<a> 标签，请使用以下xpath：

//tbody//tr//td//strong//a

点击search按钮后，可以循环解压。首先你需要所有位于.visibility_of_all_elements_located的元素：

search.click()

elements = WebDriverWait(driver, 60).until(EC.visibility_of_all_elements_located((By.XPATH, "//tbody//tr//td//strong//a")))

print(len(elements))

for element in elements:
    get_text = element.text 
    print(get_text)
    url_number = element.get_attribute('onclick').replace('window.open("/sp/', '').replace('");return false;', '')
    get_url = 'http://www.txsmartbuy.com/sp/' +url_number
    print(get_url)

结果之一：

IFB HHS0006862，毯子，圣安吉洛食堂转售。 529-96596。 http://www.txsmartbuy.com/sp/HHS0006862

【讨论】：