获取内部 HTML - Selenium、BeautifulSoup、Python答案

【问题标题】：Getting inner HTML - Selenium, BeautifulSoup, Python获取内部 HTML - Selenium、BeautifulSoup、Python
【发布时间】：2016-03-21 08:05:44
【问题描述】：

这是对问题的完整编辑，因为根据答案，我的问题肯定问得不好 - 所以我会尽量说得更清楚。

我有一个想要抓取的对象。在我的笔记本电脑上使用的代码中，我可以毫无问题地让它工作。当我转移到 Pythonanywhere 时，我再也无法获得我正在寻找的信息。

在我的系统上运行的代码是：

from urllib.request import urlopen
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import csv
import time
import re

#68 lines of code for another section of the site above this working well on my system and on pythonanywhere.

pageSource = driver.page_source
bsObj = BeautifulSoup(pageSource)

try:
    parcel_number = bsObj.find(id="mParcelnumbersitusaddress_mParcelNumber")
    s_parcel_number =parcel_number.get_text()                         
except AttributeError as e:
    s_parcel_number = "Parcel Number not found"

# same kind of code (all working) that gets 10 more pieces of data

# Tax Year
try:
    pause = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, "TaxesBalancePaymentCalculator")))
    taxes_owed_2015_yr = bsObj.findAll(id="mGrid_RealDataGrid")[1].findAll('tr')[1].findAll('td')[0]
except IndexError as e:
    s_taxes_owed_2015_yr = "No taxes due"

此代码在我的带有 fireforx 的笔记本电脑上运行良好 - 在 Pythonanywhere 上，如果我打印我要抓取的页面的页面源，我会在我的表格应该位于的位置得到以下信息：

<table border="0" cellpadding="5" cellspacing="0" class="WithBorder" width="100%">
<tbody><tr>
<td id="TaxesBalancePaymentCalculator"><!--DONT_PRINT_START-->
<span class="InputFieldTitle" id="mTabGroup_Taxes_mTaxChargesBalancePaymentInjected_mReportProcessingNote">Please wait while your current taxes are calculated.</span><img src="images/progress.gif"/> <!--DONT_PRINT_FINISH--></td>
</tr> <!--DONT_PRINT_START-->
<script type="text/javascript">
                                function TaxesBalancePaymentCalculator_ScriptLoaded( pPageContent )
                                {
                                    element('TaxesBalancePaymentCalculator').innerHTML = pPageContent;
                                }
                                function results_ready()
                                {
                                    element('pay_button_area').style.display = 'block';
                                    element('pay_button_area2').style.display = 'block';
                                    element('pay_additional_things_area').style.display = 'block';
                                }
                                var no_taxes_calculator = '&amp;nbsp;&lt;' + 'span class="MessageTitle"&gt;The tax balance calculator is not availab
le.&lt;' + '/span&gt;';
                                function no_taxes_calculator_available()
                                {
                                    element('TaxesBalancePaymentCalculator').innerHTML = no_taxes_calculator;
                                }
                                function invalid()
                                {
                                    element('TaxesBalancePaymentCalculator').innerHTML = no_taxes_calculator;
                                }
                                loadScript( 'injected/TaxesBalancePaymentCalculator.aspx?parcel_number=15-720-01-01-00-0-00-000' );
                                </script><script id="injected_taxesbalancepaymentcalculator_ScriptTag" type="text/javascript"></script>
<tr id="pay_button_area" style="DISPLAY: none">
<td id="pay_button_area2">
<table border="0" cellpadding="2" cellspacing="0">
<tbody><tr>

我玩过，发现如果我得到innerHTML（作为str）：

element('TaxesBalancePaymentCalculator').innerHTML = pPageContent;

该部分保存我的数据 - 问题是我无法在字符串上执行 findAll 并且我需要表中的某些行：

taxes_owed_2015_yr = bsObj.findAll(id="mGrid_RealDataGrid")[1].findAll('tr')[1].findAll('td')[0]

我需要有关如何将该元素作为对象（而不是字符串）获取的帮助，以便我可以在我的数据中使用它。我已经尝试了很多东西，我无法在这里一一列举。我真的可以请一些帮助。

提前致谢。

【问题讨论】：

我不记得Python 中的任何findAll 方法。这是bs4 方法...在您的代码中导入bs4 吗？你想用bsObj做什么？
是的，它是一种 bs4 方法，我已经导入了 bs4——高了几百行。我正在尝试从内部 HTML 中的表格中获取信息--
根据文档，driver.get_attribute 返回一个字符串，因此出现错误。
@Raymond，恐怕bs4 模块的工作方式有点不同......你应该多读一些crummy.com/software/BeautifulSoup/bs4/doc

标签： python html selenium beautifulsoup html-parsing

【解决方案1】：

我认为这可能是页面加载速度的差异。在代码的开头，您有

pageSource = driver.page_source
bsObj = BeautifulSoup(pageSource)

因此，此时您正在根据页面内容创建 BeautifulSoup 对象。稍后，您将这样做：

pause = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, "TaxesBalancePaymentCalculator")))
taxes_owed_2015_yr = bsObj.findAll(id="mGrid_RealDataGrid")[1].findAll('tr')[1].findAll('td')[0]

因此，您是在告诉 WebDriver 等到某些东西出现，然后对您之前创建的 BeautifulSoup 对象进行查询。但是 BeautifulSoup 对象仍然具有脚本开头的页面源，而不是带有您等待的对象的新页面源。

等待完成后，尝试根据新页面源重新创建bsObj。

【讨论】：

【解决方案2】：

正如@Steve 在 cmets 中所指出的，get_attribute 返回字符串，而不是 HTML 元素。尝试用一些 get_element_by_* 替换这一行。你可以阅读更多关于文档http://selenium-python.readthedocs.org/api.html#selenium.webdriver.remote.webelement.WebElement.find_element_by_tag_name

除此之外，您以错误的方式使用 beautifulsoup。您需要通过将 html 作为参数传递来创建 bs4 对象，然后在对象中使用 findAll：

soup = BeautifulSoup(html_as_plain_text)
for element in soup.findAll(id="mGrid_RealDataGrid"):
    #do your thing

【讨论】：

【解决方案3】：

根据我在代码中看到的内容，您想要获取元素的innerHTML 并将其提供给BeautifulSoup 以进行进一步解析。首先，您可能需要outerHTML 在生成的 HTML 中获取元素本身，而且，最重要的是，您需要初始化“soup”对象：

from bs4 import BeautifulSoup

demo_div = driver.find_element_by_id('TaxesBalancePaymentCalculator')
demo_html = demo_div.get_attribute('outerHTML')

soup = BeautifulSoup(demo_html, "html.parser")  # < YOU ARE MISSING THIS PART
s_taxes_owed_2015_yr = soup.find_all(id="mGrid_RealDataGrid")[1].find_all('tr')[1].find_all('td')[0].get_text()
print(s_taxes_owed_2015_yr)

【讨论】：

看起来不错 - 但我仍然收到元素超出限制错误，因为该表从未在 pythonanywhere firefox 浏览器中加载。
@Raymond 这是一个单独的问题。让我们避免在一个主题中修复多个问题。如果您需要帮助，请考虑创建一个包含详细信息的单独问题。谢谢。