如何在 Python 中使用 ul-li 下拉菜单抓取网站？答案

【问题标题】：How to scrape a website with an ul-li dropdown in Python?如何在 Python 中使用 ul-li 下拉菜单抓取网站？
【发布时间】：2020-07-14 18:28:21
【问题描述】：

基于问题Scraping a specific website with a search box and javascripts in Python，我正在尝试从https://www.msci.com/esg-ratings/ 网站上获取公司评级主要是，在搜索框中输入公司名称，在下拉菜单中选择该名称的所有选项（“RIO TINTO LIMITED”和“RIO TINTO PLC”这里是“rio tinto”），然后获得评级位于右上角。

但是，我在处理推荐公司的 ul-li dropout 菜单时遇到了麻烦：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

options = webdriver.ChromeOptions()
options.add_argument('-headless')
options.add_argument('-no-sandbox')
options.add_argument('-disable-dev-shm-usage')
options.add_argument('window-size=1920,1080')

wd = webdriver.Chrome(options=options)
wd.get('https://www.msci.com/esg-ratings')

WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="_esgratingsprofile_keywords"]'))).send_keys("RIO TINTO")
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="ui-id-1"]/li[1]'))).click()
#WebDriverWait(wd,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"#_esgratingsprofile_esg-ratings-profile-header > div.esg-ratings-profile-header-ratingdata > div.ratingdata-container > div.ratingdata-outercircle.esgratings-profile-header-yellow > div")))
print(wd.find_element_by_xpath('//*[@id="_esgratingsprofile_esg-ratings-profile-header"]/div[2]/div[1]/div[2]/div'))

（代码给出了 ElementClickInterceptedException。）

如何访问“RIO TINTO LIMITED”和“RIO TINTO PLC”所需的数据？

【问题讨论】：

您是否使用无头，因为网站是从脚本动态生成的？
@RobertHarvey 我在 Google Colab 工作，如果没有 headless，webdriver 无法启动。
@gostinnaya 看到我的回答并询问是否有疑问

标签： javascript python selenium web-scraping selenium-chromedriver

【解决方案1】：

我在处理推荐公司的 ul-li dropout 菜单时遇到了麻烦

这是意料之中的，因为您所定位的 element 是通过 dynamic 脚本呈现的。你必须避开options.add_argument('-headless')才能克服这个问题。

你这里也有问题

print(wd.find_element_by_xpath('//*[@id="_esgratingsprofile_esg-ratings-profile-header"]/div[2]/div[1]/div[2]/div'))

您尝试打印元素的位置。由于目标元素是由CSS 渲染的icon，因此您不能使用print() 来输出它。相反，您需要将其另存为，例如.png 文件

with open('filename.png', 'wb') as file:
    file.write(driver.find_element_by_xpath('//*[@id="_esgratingsprofile_esg-ratings-profile-header"]/div[2]/div[1]/div[2]/div').screenshot_as_png)

然后根据您的需要使用它。

【讨论】：

如果没有“headless”参数，Chrome 网络驱动程序无法在 Google Colab 中启动 (WebDriverException: DevToolsActivePort file doesn't exist)。随之而来的是 ElementClickInterceptedException。那我该如何避免options.add_argument('-headless')呢？
更新：我通过添加窗口大小参数 time.sleep 和 element = wd.find_element_by_xpath('//*[@id="ui-id-1"]/li[1]') wd.execute_script("arguments[0].click();", element) 来修复它，谢谢！
@gostinnaya 很高兴你解决了它。如果此答案对您的问题有所帮助，请单击答案旁边的复选标记将其标记为已接受。请参阅here 了解更多信息