【问题标题】:HTML elements missing from Selenium page source, but can be found using BeautifulSoupSelenium 页面源中缺少 HTML 元素,但可以使用 BeautifulSoup 找到
【发布时间】:2019-10-21 01:49:26
【问题描述】:

到目前为止,我已经查看了thisthis

我正在尝试使用 Selenium 解析 HTML 源代码。为了让事情变得更容易(或者我是这么想的),我从要解析的网页中提取了 HTML,并将其放入本地 HTML 文件中。

BeautifulSoup 在查看 HTML 时没有问题,但 Selenium 只是出于某种原因看不到它。

HTML:

<html><head>
<meta http-equiv="Cache-control" content="no-cache">
<title>NOTICE TO CORES USERS</title>
</head>
<body>
<center>
<b>

<h1>Welcome</h1>
Hours of Operation<br><br>
Monday-Friday 6:00am - 10:00pm<br>
Saturday 8:00am - 6:00pm<br>
Sunday 12:00 noon - 6:00pm<br>
<br>
<br>
<h4><font color="purple"><p><b><u>CORES ORACLE UPGRADE</u></b><br>
<br>
<font color="BLACK">
<font color="RED">Due to a recent technical upgrade, CORES is experiencing a number of issues. We are aware of these issues and our teams are working to resolve them. Corporate Registry will provide updates when available. Corporate Registry apologizes for any inconvenience.
<br>
<br>
<font color="RED">Effective February 3, 2019, <font color="BLACK"> Corporate Registry will send annual return reminders by email to corporations, non-profit organizations, limited liability partnerships, and cooperatives where there is an email address on record.<br>
<br>
Annual return reminders will be emailed about two weeks before the annual return is due.  The reminders will continue to be sent by regular mail
when there is no e-mail address on file or when there is a notice because the previous year's annual return has not been filed.  Directors of Alberta corporations will continue to receive copies of the outstanding annual return notice by regular mail.
<br>
<br>
</font></font></font></font></p><h4><font color="BLACK"><font color="RED"><font color="RED"><font color="BLACK"><font color="purple"><p><b><u>EXTENDED OUTAGE DATES</u></b><br>
<font color="RED"></font></p><p align="CENTRE"><font color="RED">
FULL DAY outages to allow for technical preventive maintenance are as follows:
<br>
<br>
<font color="BLACK">Sunday, May 12, 2019<br>
<font color="BLACK">Sunday, June 9, 2019<br>
<font color="BLACK">Sunday, July 14, 2019<br>
<font color="BLACK">Sunday, August 11, 2019<br>
<font color="BLACK">Sunday, September 8, 2019<br>
<font color="BLACK">Sunday, October 13, 2019<br>
<font color="BLACK">Sunday, November 10, 2019<br>
<font color="BLACK">Sunday, December 8, 2019<br>

</font></font></font></font></font></font></font></font></font></p><p align="CENTRE"></p><h5><font color="RED"><font color="BLACK"><font color="BLACK"><font color="BLACK"><font color="PURPLE">Updated: April 30, 2019</font></font></font></font></font></h5><p></p><font color="RED"><font color="BLACK"><font color="BLACK"><font color="BLACK"><font color="PURPLE">

<br>
<br>
<form action="cr_login.menu_frame" method="post">
<input type="hidden" name="p_default_menu" value="5">
<input type="hidden" name="p_system" value="CR">
<input type="hidden" name="p_accreditation" value="1">
<input type="hidden" name="p_spuid" value="30825">
<input type="hidden" name="p_userid" value="A02526">
<input type="submit" value="Continue">
</form>
</font></font></font></font></font></font></font></font></font></font></h4></font></h4></b></center><b><font color="purple"><font color="RED"><font color="RED"><font color="purple"><font color="RED"><font color="BLACK"><font color="BLACK"><font color="BLACK"><font color="PURPLE">


</font></font></font></font></font></font></font></font></font></b></body></html>

Python:

from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys

chrome_options = Options()
chrome_options.binary_location = '/usr/bin/google-chrome'
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

driver = webdriver.Chrome(chrome_options=chrome_options)
#driver.get('file:/c/Users/lanes/learning/product-repo/backend/functions/src/cores-scraper/oracle_upgrade.html')
driver.implicitly_wait(5)
driver.get('file:oracle_upgrade.html')

print('page source:', driver.page_source)

soup = BeautifulSoup(open('oracle_upgrade.html', 'r'), 'html.parser')

print('\nsoup:', soup)
    
driver.close()

输出:

page source: <html><head></head><body></body></html>

soup: <html><head>
<meta content="no-cache" http-equiv="Cache-control"/>
<title>NOTICE TO CORES USERS</title>
</head>
<body>
<center>
<b>
<h1>Welcome</h1>
Hours of Operation<br/><br/>
Monday-Friday 6:00am - 10:00pm<br/>
Saturday 8:00am - 6:00pm<br/>
Sunday 12:00 noon - 6:00pm<br/>
<br/>
<br/>
<h4><font color="purple"><p><b><u>CORES ORACLE UPGRADE</u></b><br/>
<br/>
<font color="BLACK">
<font color="RED">Due to a recent technical upgrade, CORES is experiencing a number of issues. We are aware of these issues and our teams are working to resolve them. Corporate Registry
will provide updates when available. Corporate Registry apologizes for any inconvenience.
<br/>
<br/>
<font color="RED">Effective February 3, 2019, <font color="BLACK"> Corporate Registry will send annual return reminders by email to corporations, non-profit organizations, limited liability partnerships, and cooperatives where there is an email address on record.<br/>
<br/>
Annual return reminders will be emailed about two weeks before the annual return is due.  The reminders will continue to be sent by regular mail
when there is no e-mail address on file or when there is a notice because the previous year's annual return has not been filed.  Directors of Alberta corporations will continue to receive copies of the outstanding annual return notice by regular mail.
<br/>
<br/>
</font></font></font></font></p><h4><font color="BLACK"><font color="RED"><font color="RED"><font color="BLACK"><font color="purple"><p><b><u>EXTENDED OUTAGE DATES</u></b><br/>
<font color="RED"></font></p><p align="CENTRE"><font color="RED">
FULL DAY outages to allow for technical preventive maintenance are as follows:
<br/>
<br/>
<font color="BLACK">Sunday, May 12, 2019<br/>
<font color="BLACK">Sunday, June 9, 2019<br/>
<font color="BLACK">Sunday, July 14, 2019<br/>
<font color="BLACK">Sunday, August 11, 2019<br/>
<font color="BLACK">Sunday, September 8, 2019<br/>
<font color="BLACK">Sunday, October 13, 2019<br/>
<font color="BLACK">Sunday, November 10, 2019<br/>
<font color="BLACK">Sunday, December 8, 2019<br/>
</font></font></font></font></font></font></font></font></font></p><p align="CENTRE"></p><h5><font color="RED"><font color="BLACK"><font color="BLACK"><font color="BLACK"><font color="PURPLE">Updated: April 30, 2019</font></font></font></font></font></h5><p></p><font color="RED"><font color="BLACK"><font color="BLACK"><font color="BLACK"><font color="PURPLE">
<br/>
<br/>
<form action="cr_login.menu_frame" method="post">
<input name="p_default_menu" type="hidden" value="5"/>
<input name="p_system" type="hidden" value="CR"/>
<input name="p_accreditation" type="hidden" value="1"/>
<input name="p_spuid" type="hidden" value="30825"/>
<input name="p_userid" type="hidden" value="A02526"/>
<input type="submit" value="Continue"/>
</form>
</font></font></font></font></font></font></font></font></font></font></h4></font></h4></b></center><b><font color="purple"><font color="RED"><font color="RED"><font color="purple"><font color="RED"><font color="BLACK"><font color="BLACK"><font color="BLACK"><font color="PURPLE">
</font></font></font></font></font></font></font></font></font></b></body></html>

问题:

为什么 Selenium 不像 Soup 那样“看到”正文中的 HTML?

【问题讨论】:

  • 您能否以非无头模式加载页面以确保在尝试打开页面时加载数据。我猜driver.get('file:oracle_upgrade.html') 没有完全加载页面。
  • @supputuri 在没有--headless arg 的情况下运行代码会导致同样的错误,不知道为什么。

标签: python selenium selenium-webdriver selenium-chromedriver


【解决方案1】:

我能够复制您的问题。

您被骗了,因为如果 driver.get() 无法加载您的文件,它不会真正返回错误。相反, driver.page_source 将包含一个几乎是空的文档。我不确定您的文件位于何处,但我认为您只是缺少文件 URI 应该以 file:// 而不是 file: 开头

以下代码对我有用:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get('file:///Users/jimmy/src/stackoverflow/html-elements-missing-from-selenium-page-source-but-can-be-found-using-beautifu/oracle_upgrade.html')
print('page source:', driver.page_source)
driver.close()

您不应期望与输入完全相同的输出,因为 chrome 会为您“修复”您的 html。例如,如果您忘记了 ,它会礼貌地将其添加到源代码中而不会抱怨。

【讨论】:

  • 我尝试调整文件路径,如下所示:'file:///c/Users/path/to/the/file.html 并且它能够按预期读取 html 文件。谁知道这是这么简单的事情?三个/ 是关键。
  • 这就是为什么我总是喜欢先在非无头模式下检查页面,然后是无头模式。
【解决方案2】:

我无法使用以下方法重现您的问题:

  • Python 3.7.3
  • Selenium 3.141.0(根据pip show命令
  • Chrome 驱动程序74.0.3729.6
  • 铬 74.0.3729.169


所以我会推荐

  • 使用 pip 升级到最新的selenium package 版本,例如:

    pip install --upgrade selenium
    
  • 交叉检查您的 Chrome 和 ChromeDriver 版本 - 必须 100% 匹配

  • 您也可以尝试通过XPath expression 获取页面源

    print(driver.find_element_by_xpath("/html").get_attribute("innerHTML"))
    

【讨论】:

    猜你喜欢
    • 2016-11-19
    • 1970-01-01
    • 2021-06-30
    • 1970-01-01
    • 1970-01-01
    • 2020-10-26
    • 1970-01-01
    • 2016-07-19
    • 1970-01-01
    相关资源
    最近更新 更多