如何使用 Python 从网页的检查元素中获取数据答案

【问题标题】：How to get data from inspect element of a webpage using Python如何使用 Python 从网页的检查元素中获取数据
【发布时间】：2014-09-21 12:50:37
【问题描述】：

我想使用 Python 从检查元素中获取数据。我可以使用 BeautifulSoup 下载源代码，但现在我需要来自网页检查元素的文本。如果您能建议我如何做到这一点，我将不胜感激。

编辑：通过检查元素，我的意思是，在谷歌浏览器中，右键单击为我们提供了一个名为检查元素的选项，其中包含与该特定页面的每个元素相关的代码。我想提取该代码/只是它的文本字符串。

【问题讨论】：

你必须更清楚地描述你想要做什么。什么是“检查元素”？请举例说明你想做什么。
它不使用 Python，但如果您在编辑器中右键单击蓝色突出显示的行，chrome 允许您 Copy as HTML。
有没有其他方法可以做到这一点，因为我必须在很多页面上这样做。此外，根据我的理解，复制为 HTML 仅针对单行执行。 @安德鲁约翰逊
不能把下载的html全部提取出来吗？
正确。 Copy as HTML 只为您提供一页中选定的元素。下面我将提供一个简单的网络爬虫，它会自动通过 python 为您提供类似的输出。

标签： python html extract

【解决方案1】：

我想更新 Jason S 的回答。我无法在 OS X 上启动 phantomjs

driver = webdriver.PhantomJS()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File     "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 50, in __init__
self.service.start()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/phantomjs/service.py", line 74, in start
raise WebDriverException("Unable to start phantomjs with ghostdriver.", e)
selenium.common.exceptions.WebDriverException: Message: Unable to start phantomjs with ghostdriver.

已通过答案here下载executables解决

driver = webdriver.PhantomJS("phantomjs-2.0.0-macosx/bin/phantomjs")

【讨论】：

【解决方案2】：

如果您想以运行 Javascript 的方式从 Python 自动获取网页，您应该研究 Selenium。它可以自动驱动网络浏览器（甚至是 PhantomJS 之类的无头网络浏览器，因此您不必打开窗口）。

为了获取 HTML，您需要评估一些 javascript。简单的示例代码，修改以适应：

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get("http://google.com")

# This will get the initial html - before javascript
html1 = driver.page_source

# This will get the html after on-load javascript
html2 = driver.execute_script("return document.documentElement.innerHTML;")

注意 1：如果您想要一个或多个特定元素，您实际上有两个选择——在 Python 中解析 HTML，或者编写更具体的 JavaScript 来返回您想要的内容。

注意 2：如果您确实需要来自 Chrome 工具的特定信息，而不仅仅是动态生成的 HTML，那么您需要一种方法来连接 Chrome 本身。没办法。

【讨论】：

非常感谢。它工作得很好，只是我在第二行添加了 phantomjs.exe 的位置，如下所示 driver = webdriver.PhantomJS(executable_path=phantomjs_path)
您好，感谢您的帮助。我已经使用一个类实现了这段代码，但没有将 javascript 转换为 html（它在命令行上工作正常）。请在这方面帮助我？
在我运行代码时得到这些。用户警告：PhantomJS 的 Selenium 支持已被弃用，请改用无头版本的 Chrome 或 Firefox。 FileNotFoundError：[Errno 2] 没有这样的文件或目录：'phantomjs'

【解决方案3】：

BeautifulSoup 可用于解析 html 文档，并提取您想要的任何内容。它不是为下载而设计的。你可以通过它的类和id找到你想要的元素。

【讨论】：

【解决方案4】：

Inspect 元素显示页面的所有 HTML，这与使用 urllib 获取 html 相同

做这样的事情

import urllib
from bs4 import BeautifulSoup as BS

html = urllib.urlopen(URL).read()

soup = BS(html)

print soup.findAll(tag_name).get_text()

【讨论】：

你能更新一下吗？它在 2021 年不再起作用