BeautifulSoup、Selenium 和 Python，通过标签解析答案

【问题标题】：BeautifulSoup, Selenium and Python, parsing by a tagBeautifulSoup、Selenium 和 Python，通过标签解析
【发布时间】：2019-05-19 18:21:15
【问题描述】：

我正在尝试解析来自该网站的数据

https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010

特别是，我正在尝试获取 Criterion(ITC) 下的数据。我想要的文字是 CC+ECT

我想要的 html 中的信息似乎是

<a class= js-glossary data-leg= "CC+ECT">

我是网络抓取的新手，我尝试了教程中教授的技术，但没有奏效。我听说过 Selenium，也试过了。但是，这段代码也不起作用。

from selenium import webdriver
from bs4 import BeautifulSoup
import requests

driver = webdriver.Firefox(executable_path = r"D:\Python work\driver\geckodriver.exe")
driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010")
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
data = soup.find_all("a", attrs= {"class":"js-glossary"})

代码生成一个空列表。我还读到我可以通过将汤标签视为字典来提取数据。在这种情况下

data["data-leg"]

我是在正确的轨道上还是偏离了方向？

【问题讨论】：

标签： python selenium selenium-webdriver beautifulsoup webdriverwait

【解决方案1】：

您试图通过 JavaScript 动态生成的文本。要获得它，您需要等待它的出现：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

driver = webdriver.Firefox(executable_path = r"D:\Python work\driver\geckodriver.exe")
driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010")
text = WebDriverWait(driver, 5).until(lambda driver: driver.find_element_by_xpath('//div[.="criterion(itc)"]/following-sibling::div').text)
print(text)
#  'CC + ECT'

【讨论】：

谢谢！您的脚本运行良好。我将不得不阅读更多关于你使用的语法，因为它对我来说非常先进。
您可以阅读更多关于 Selenium 中的等待 here

【解决方案2】：

看来你已经很接近了。如果您使用 Selenium，您甚至可能不需要 Beautiful Soup。使用 Selenium 您需要诱导 WebDriverwait 以使所需的 元素可见，您可以使用以下解决方案：

代码块：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox(executable_path = r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010")
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='lbl' and text()='criterion(itc)']//following::div[1]/a"))).get_attribute("innerHTML"))

控制台输出：

                                CC + ECT

【讨论】：