Python Selenium 数据不加载（网站安全）答案

【问题标题】：Python Selenium Data does not load (website Security)Python Selenium 数据不加载（网站安全）
【发布时间】：2020-11-30 11:33:25
【问题描述】：

请在下面找到我尝试下载/抓取“csv”文件的代码。代码是测试的第一阶段，即使没有错误，它也会失败。 --数据不加载到壁虎驱动中

from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time

driver = webdriver.Firefox(executable_path="C:\Py378\prj14\geckodriver.exe")

driver.get("https://www.nseindia.com/market-data/live-equity-market")
time.sleep(5)

element_dorpdown = Select(driver.find_element_by_id("equitieStockSelect"))
element_dorpdown.select_by_index(44)   #Updated with help of @PDHide in the comments
time.sleep(5)

代码执行正常，但由于网站的安全设置，无法加载与选项相关的数据，并且当我手动选择并更新选项时，表格没有更新，好像没有进行选择一样。（也许它开始了解它的硒驱动程序，并且需要标题，但不确定......）另外，当我尝试点击“以 CSV 格式下载”时，它会超时。

我需要下载F&O的csv，选择成功后（如上图）...请帮助...

我可以在普通浏览器（已安装）上浏览网站，但是当我使用 python(selenium) 时，它在那些浏览器上就失败了……请问如何绕过安全性？？？

【问题讨论】：

添加html dom
@PDHide ，感谢您对 PDHide 的回复...但不知道该怎么做，我只是在学习这个..请您帮忙解决这个问题..
复制按 f12 时得到的内容，选择 calss 仅适用于选择标签
@PDHide ，好吧，我从你的建议中了解到，我已经以这种方式更新了代码；.... element_dorpdown = Select(driver.find_element_by_class_name("no-border-radius")) 。 ..但它仍然没有更新选项...（也许它检测硒驱动程序，并且需要标头，但不确定..只是猜测）
@PDHide ，当我尝试更新选项甚至手动从页面下载 csv 时（不使用 python），页面只是超时.. 你确定你正在尝试通过 selenium，因为我的页面在获取网页的第一步后无法更新

标签： python-3.x selenium drop-down-menu geckodriver

【解决方案1】：

我尝试执行代码（使用 Chrome，但这不重要）或者我应该说，它的轻微变化，以便我可以更好地了解发生了什么（注意我使用 implicitly_wait 而不是 @987654324 @，后者浪费时间）。这里我只是想选择第二个选项：

from selenium import webdriver
from selenium.webdriver.support.ui import Select

options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)

try:
    driver.implicitly_wait(3) # wait up to 3 seconds before calls to find elements time out
    driver.get("https://www.nseindia.com/market-data/live-equity-market")
    select = Select(driver.find_element_by_id("equitieStockSelect"))
    select.select_by_index(1)
finally:
    input('pausing...')
    driver.quit()

如您所见，选择第二个选项没有问题。但是，新表无法加载：

此时，我在页面上手动发出重新加载，我得到以下结果。我的结论是该网站正在检测浏览器正在自动运行并阻止访问：

更新

因此可以使用requests 检索数据。我使用 Chrome 检查器查看网络 XHR 请求，然后选择第二个选项 (NIFTY NEXT 50) 并观察正在发出的 AJAX 请求：

在这种情况下，URL 是：https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%20NEXT%2050。但是，您必须首先使用 requests Session 实例获取初始页面：

import requests

try:
    s = requests.Session()
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'}
    s.headers.update(headers)
    # You have to first retrieve the initial page:
    resp = s.get('https://www.nseindia.com/market-data/live-equity-market')
    resp.raise_for_status()
    #print(resp.text)
    resp = s.get('https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%20NEXT%2050')
    resp.raise_for_status()
    data = resp.json()
    print(data)
except Exception as e:
    print(e)

打印：

{'name': 'NIFTY NEXT 50', 'advance': {'declines': '25', 'advances': '24', 'unchanged': '1'}, 'timestamp': '27-Nov-2020 16:00:00', 'data': [{'priority': 1, 'symbol': 'NIFTY NEXT 50', 'identifier': 'NIFTY NEXT 50', 'open': 30316.45,  etc. (data too long) }

更新 2

通常，要计算 URL，您需要获取任何索引，例如索引 44，查看该索引的相应选项值，在本例中为“F&O 中的证券”，并将其替换为变量 option_value以下程序：

from urllib.parse import quote_plus

option_value = 'SECURITIES IN F&O'

url = 'https://www.nseindia.com/api/equity-stockIndices?index=' + quote_plus(option_value)
print(url)

打印：

https://www.nseindia.com/api/equity-stockIndices?index=SECURITIES+IN+F%26O

上面的 URL 是要使用的值。

【讨论】：

谢谢@Booboo，是的，请，几分钟前我已经更新了页面/问题...我可以选择更改，但是网上没有关于如何绕过的帮助网页的安全性...我在这方面找不到任何博客/硒模块...
我不知道如何使用标准 python 库的请求库...我不确定这是否有办法绕过相同的安全性..
我已使用requests 更新了答案。抱歉，我不小心删除了之前答案中的几张图片。
非常感谢@Booboo...非常感谢您的解决方案...我还有一个要求...最初的帖子是针对“F&O 中的证券”或“index_value(44 )".. 由于我是图书馆的新手，请您为该图书馆提出建议...在 ["resp = s.get......] 行
URL 将是：https://www.nseindia.com/api/equity-stockIndices?index=SECURITIES%20IN%20F%26O。查看更新的答案。