使用 python 请求保存整个网页而不是基本 html 以进行抓取答案

【问题标题】：Save a whole web page instead of basic html with python requests for scraping使用 python 请求保存整个网页而不是基本 html 以进行抓取
【发布时间】：2020-08-17 10:31:30
【问题描述】：

所以我想使用 Beautiful Soup 来抓取这个页面：https://www.nseindia.com/option-chain#optionchain_equity 并使用 requests 模块访问它。但我猜 requests 只保存基本的 html 而不是该页面中的主表。使用 chrome 下载“网页，完成”的作品，但我怎样才能在 python 中自动化呢？同样没有这些标头，请求会超时，所以我猜是有必要的。代码：

import requests

url = "https://www.nseindia.com/option-chain#optionchain_equity"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                         'Chrome/80.0.3987.149 Safari/537.36',
           'accept-language': 'en,gu;q=0.9,hi;q=0.8', 'accept-encoding': 'gzip, deflate, br'}
response = requests.get(url, headers=headers, timeout=5)
file = open("nse.html", "w")
file.write(response.text)

【问题讨论】：

基本 HTML 是什么意思？
完成后记得file.close()，用with打开可能会更方便
这能回答你的问题吗？ Web scraping program cannot find element which I can see in the browser

标签： python python-3.x web-scraping python-requests

【解决方案1】：

如果您主要是查找表数据，那么该表数据是通过 ajax 调用加载的。

以下脚本主要是将数据保存到json文件中。

import requests, json

headers = {'user-agent':"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"}

res = requests.get("https://www.nseindia.com/api/option-chain-indices?symbol=NIFTY", headers=headers)

with open("data.json", "w") as f:
     json.dump(res.json(), f)

【讨论】：

这行得通，但是有什么方法可以让我获得包含表格的页面的 javascript 呈现的 html 吗？因为旧版网站没有动态渲染表格，所以我已经准备好抓取表格的代码了。
可以使用selenium下载带有表格的html

【解决方案2】：

如果你想保存整个网页，你可能会尝试找到类似无头 chrome API 之类的东西：

Download file through Google Chrome in headless mode

要中断网页，使用简单的 python 将无济于事，它只是作为文件读取流处理，你想要的是文件读取和网络浏览器行为，无头 chrome API 是要走的路。 ...

【讨论】：