无法使用 BeautifulSoup 抓取数据答案

【问题标题】：Not able to Scrape data using BeautifulSoup无法使用 BeautifulSoup 抓取数据
【发布时间】：2018-07-31 14:15:54
【问题描述】：

我正在使用 Selenium 登录网页并获取网页进行抓取我能够得到页面。我已经在 html 中搜索了我想要抓取的表格。这里是：-

<table cellspacing="0" class=" tablehasmenu table hoverable sensors" id="table_devicesensortable">

这是脚本：-

rawpage=driver.page_source #storing the webpage in variable
souppage=BeautifulSoup(rawpage,'html.parser') #parsing the webpage
tbody=souppage.find('table', attrs={'id':'table_devicesensortable'}) #scrapping

我能够在 souppage 变量中获取已解析的网页。但无法抓取并存储在 tbody 变量中。

【问题讨论】：

标签： python selenium web-scraping beautifulsoup

【解决方案1】：

根据您共享的 HTML，您已将WebDriverWait 与expected_conditions 子句设置为presence_of_element_located 相结合，以刮取<table>，为了实现这一目标，您可以使用以下任一方法代码块：

使用class：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//table[@class=' tablehasmenu table hoverable sensors' and @id='table_devicesensortable']")))
rawpage=driver.page_source #storing the webpage in variable
souppage=BeautifulSoup(rawpage,"html.parser") #parsing the webpage
tbody=souppage.find("table",{"class":" tablehasmenu table hoverable sensors"}) #scrapping

使用id：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//table[@class=' tablehasmenu table hoverable sensors' and @id='table_devicesensortable']")))
rawpage=driver.page_source #storing the webpage in variable
souppage=BeautifulSoup(rawpage,"html.parser") #parsing the webpage
tbody=souppage.find("table",{"id":"table_devicesensortable"}) #scrapping

【讨论】：

我在发帖前尝试了这两种方法，但都没有效果。
根据您的要求更新了我的答案，以等待动态获取的内容和成功的数据抓取。让我知道状态。

【解决方案2】：

必需的表可能是动态生成的，所以你需要等到它出现在页面上：

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait

tbody = wait(driver, 10).until(EC.presence_of_element_located((By.ID, "table_devicesensortable")))

另外请注意，没有必要使用 BeautifulSoup，因为 Selenium 有足够的内置方法和属性来为您完成相同的工作，例如

headers = tbody.find_elements_by_tag_name("th")
rows = tbody.find_elements_by_tag_name("tr")
cells = tbody.find_elements_by_tag_name("td")
cell_values = [cell.text for cell in cells]
etc...

【讨论】：

感谢信息

【解决方案3】：

我在 stackoverflow 上搜索这个问题并遇到了这篇文章

BeautifulSoup returning none when element definitely exists

通过阅读luiyzheng提供的答案，我得到了可能是因为数据是动态获取的提示。所以，表可能是动态创建的，因此我无法找到。

所以，解决方法是：-

在存储网页之前我放了一个延迟

所以代码是这样的

time.sleep(4)
rawpage=driver.page_source #storing the webpage in variable
souppage=BeautifulSoup(rawpage,"html.parser") #parsing the webpage
tbody=souppage.find("table",{"id":"table_devicesensortable"}) #scrapping

我希望它可以帮助某人。

【讨论】：

使用time.sleep() 违反所有最佳做法。 This answer 会帮助你。