为什么 python 和我的网络浏览器为同一个链接显示不同的代码？答案

【问题标题】：Why does python and my web browser show different codes for the same link?为什么 python 和我的网络浏览器为同一个链接显示不同的代码？
【发布时间】：2016-07-25 23:26:50
【问题描述】：

我们以网址https://www.google.cl/#q=stackoverflow 为例。在搜索给出的第一个链接上使用 Chrome 开发人员工具，我们会看到以下 html 代码：

现在，如果我运行这段代码：

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = urlopen("https://www.google.cl/#q=stackoverflow")
soup = BeautifulSoup(url)
print(soup.prettify())

我找不到相同的元素。事实上，我不会从谷歌搜索给出的结果中找到任何链接。如果我使用 requests 模块也是如此。为什么会这样？我可以做一些事情来获得与从网络浏览器请求相同的结果吗？

【问题讨论】：

这是一个动态加载的页面。
@MoonCheesez 有没有办法像 Chrome 一样获得真正的 HTML 代码？

标签： python html

【解决方案1】：

由于 html 是动态生成的，可能来自现代单页 javascript 框架，如 Angular 或 React（甚至只是纯 JavaScript），因此在解析 dom 之前，您需要使用 selenium 或 phantomjs 实际驱动浏览器访问该站点。

这是一些骨架代码。

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get("http://google.com")

html = driver.execute_script("return document.documentElement.innerHTML")
soup = BeautifulSoup(html)

这里是 selenium 文档，了解有关运行 selenium、配置等的更多信息：

http://selenium-python.readthedocs.io/

编辑：在抓取 html 之前，您可能需要添加 wait，因为加载页面的某些元素可能需要一秒钟左右的时间。 python selenium 的显式等待文档参考如下：

http://selenium-python.readthedocs.io/waits.html

另一个复杂的来源是页面的某些部分可能会隐藏，直到用户交互之后。在这种情况下，您需要编写 selenium 脚本以在获取 html 之前以特定方式与页面交互。

【讨论】：