使用 Selenium 和 Python 将表数据提取到 pandas 数据帧中答案

【问题标题】：Extracting Table data using Selenium and Python into pandas dataframe使用 Selenium 和 Python 将表数据提取到 pandas 数据帧中
【发布时间】：2020-02-09 08:03:59
【问题描述】：

所以我已经使用库 BeautifulSoup 从表中提取数据，代码如下：

        if soup.find("table", {"class":"a-keyvalue prodDetTable"}) is not None:
        table = parse_table(soup.find("table", {"class":"a-keyvalue prodDetTable"}))
        df = pd.DataFrame(table)

所以这行得通，我得到了表格并把它解析成数据框，但是我正在尝试使用 selenium 在不同的网站上做类似的事情，这是我到目前为止的代码：

driver = webdriver.Chrome()
i = "DCD710S2"
base_url = str("https://www.lowes.com/search?searchTerm=" + str(i))
driver.get(base_url)
table = driver.find_element_by_xpath("//*[@id='collapseSpecs']/div/div/div[1]/table/tbody")

所以我来到桌子前，我尝试使用 getAttribute(innerHTML) 和其他一些 getAttribute 元素，但我无法将桌子原样放入 pandas。关于如何用硒处理这个问题的任何建议？

这是 html 的外观：

【问题讨论】：

标签： python-3.x pandas selenium beautifulsoup selenium-chromedriver

【解决方案1】：

使用 pandas 获取表格。试试下面的代码。

import pandas as pd
import time
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
i = "DCD710S2"
base_url = str("https://www.lowes.com/search?searchTerm=" + str(i))
driver.get(base_url)
time.sleep(3)
html=driver.page_source
soup=BeautifulSoup(html,'html.parser')
div=soup.select_one("div#collapseSpecs")
table=pd.read_html(str(div))
print(table[0])
print(table[1])

输出：

                                      0                     1
0                     Battery Amp Hours                   1.3
1                     Tool Power Output               189 UWO
2                  Side Handle Included                    No
3             Number of Clutch Settings                    15
4                             Case Type                  Soft
5                           Series Name                   NaN
6                    Tool Weight (lbs.)                   2.2
7                  Tool Length (Inches)                   7.5
8                   Tool Width (Inches)                   2.0
9                  Tool Height (Inches)                  7.75
10  Forward and Reverse Switch Included                   Yes
11                            Sub-Brand                   NaN
12                         Battery Type  Lithium ion (Li-ion)
13                      Battery Voltage           12-volt max
14                     Charger Included                   Yes
15                       Variable Speed                   Yes
                                   0               1
0                 Maximum Chuck Size          3/8-in
1       Number of Batteries Included               2
2                   Battery Warranty  3-year limited
3                Maximum Speed (RPM)          1500.0
4            Bluetooth Compatibility              No
5              Charge Time (Minutes)              40
6                  App Compatibility              No
7                     Works with iOS              No
8                          Brushless              No
9   CA Residents: Prop 65 Warning(s)             Yes
10                     Tool Warranty  3-year limited
11                            UNSPSC        27112700
12                Works with Android              No
13                  Battery Included             Yes
14                       Right Angle              No
15               Wi-Fi Compatibility              No

如果你想要单个数据框试试这个。

import pandas as pd
import time
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
i = "DCD710S2"
base_url = str("https://www.lowes.com/search?searchTerm=" + str(i))
driver.get(base_url)
time.sleep(3)
html=driver.page_source
soup=BeautifulSoup(html,'html.parser')
div=soup.select_one("div#collapseSpecs")
table=pd.read_html(str(div))
frames = [table[0], table[1]]
result=pd.concat(frames,ignore_index=True)
print(result)

使用 Pandas Dataframe 的 Selenium 选项。

import pandas as pd
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
spec_name=[]
spec_item=[]
driver = webdriver.Chrome()
i = "DCD710S2"
base_url = str("https://www.lowes.com/search?searchTerm=" + str(i))
driver.get(base_url)
tables=WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.XPATH,"//div[@id='collapseSpecs']//table")))
for table in tables:
    for row in table.find_elements_by_xpath(".//tr"):
        spec_name.append(row.find_element_by_xpath('./th').get_attribute('textContent'))
        spec_item.append(row.find_element_by_xpath('./td/span').get_attribute('textContent'))

df = pd.DataFrame({"Spec_Name":spec_name,"Spec_Title":spec_item})

print(df)

【讨论】：

KunduK - 非常感谢，我想我可以使用它，我只是想知道 selenium 是否有任何东西可以与 .getAttribute() 一起使用。
@Slavisha84 ：所以你也在使用硒溶液？
是的，我想看看我是否可以在不涉及美丽汤的情况下尽可能简单地编写它，但如果 Selenium 没有简单的东西，我将只使用美丽的汤。我看到很少有人这样做，先计算行数和列数，然后用 selenium 创建表，然后读取第一行第一列，但这太复杂和令人困惑。
@Slavisha84 ：也更新了 selenium 选项。
这对这个项目非常有效。但是当我为 i = ["DCL510"] 切换到不同的型号时，我收到错误：NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"./ th"} 我比较了两个搜索项的 html，它们看起来和我一模一样。为什么我会收到“DCL510”的错误消息？

【解决方案2】：

您需要安装 lxml 才能使用：

pip install lxml

代码：

import pandas as pd

i = "DCD710S2"
base_url = str("https://www.lowes.com/search?searchTerm=" + str(i))

df = pd.read_html(base_url)

print(df)

【讨论】：