【问题标题】:Extracting Table data using Selenium and Python into pandas dataframe使用 Selenium 和 Python 将表数据提取到 pandas 数据帧中
【发布时间】:2020-02-09 08:03:59
【问题描述】:

所以我已经使用库 BeautifulSoup 从表中提取数据,代码如下:

        if soup.find("table", {"class":"a-keyvalue prodDetTable"}) is not None:
        table = parse_table(soup.find("table", {"class":"a-keyvalue prodDetTable"}))
        df = pd.DataFrame(table)

所以这行得通,我得到了表格并把它解析成数据框,但是我正在尝试使用 selenium 在不同的网站上做类似的事情,这是我到目前为止的代码:

driver = webdriver.Chrome()
i = "DCD710S2"
base_url = str("https://www.lowes.com/search?searchTerm=" + str(i))
driver.get(base_url)
table = driver.find_element_by_xpath("//*[@id='collapseSpecs']/div/div/div[1]/table/tbody")

所以我来到桌子前,我尝试使用 getAttribute(innerHTML) 和其他一些 getAttribute 元素,但我无法将桌子原样放入 pandas。 关于如何用硒处理这个问题的任何建议?

这是 html 的外观:

【问题讨论】:

    标签: python-3.x pandas selenium beautifulsoup selenium-chromedriver


    【解决方案1】:

    使用 pandas 获取表格。试试下面的代码。

    import pandas as pd
    import time
    from selenium import webdriver
    from bs4 import BeautifulSoup
    driver = webdriver.Chrome()
    i = "DCD710S2"
    base_url = str("https://www.lowes.com/search?searchTerm=" + str(i))
    driver.get(base_url)
    time.sleep(3)
    html=driver.page_source
    soup=BeautifulSoup(html,'html.parser')
    div=soup.select_one("div#collapseSpecs")
    table=pd.read_html(str(div))
    print(table[0])
    print(table[1])
    

    输出

                                          0                     1
    0                     Battery Amp Hours                   1.3
    1                     Tool Power Output               189 UWO
    2                  Side Handle Included                    No
    3             Number of Clutch Settings                    15
    4                             Case Type                  Soft
    5                           Series Name                   NaN
    6                    Tool Weight (lbs.)                   2.2
    7                  Tool Length (Inches)                   7.5
    8                   Tool Width (Inches)                   2.0
    9                  Tool Height (Inches)                  7.75
    10  Forward and Reverse Switch Included                   Yes
    11                            Sub-Brand                   NaN
    12                         Battery Type  Lithium ion (Li-ion)
    13                      Battery Voltage           12-volt max
    14                     Charger Included                   Yes
    15                       Variable Speed                   Yes
                                       0               1
    0                 Maximum Chuck Size          3/8-in
    1       Number of Batteries Included               2
    2                   Battery Warranty  3-year limited
    3                Maximum Speed (RPM)          1500.0
    4            Bluetooth Compatibility              No
    5              Charge Time (Minutes)              40
    6                  App Compatibility              No
    7                     Works with iOS              No
    8                          Brushless              No
    9   CA Residents: Prop 65 Warning(s)             Yes
    10                     Tool Warranty  3-year limited
    11                            UNSPSC        27112700
    12                Works with Android              No
    13                  Battery Included             Yes
    14                       Right Angle              No
    15               Wi-Fi Compatibility              No
    

    如果你想要单个数据框试试这个。

    import pandas as pd
    import time
    from selenium import webdriver
    from bs4 import BeautifulSoup
    driver = webdriver.Chrome()
    i = "DCD710S2"
    base_url = str("https://www.lowes.com/search?searchTerm=" + str(i))
    driver.get(base_url)
    time.sleep(3)
    html=driver.page_source
    soup=BeautifulSoup(html,'html.parser')
    div=soup.select_one("div#collapseSpecs")
    table=pd.read_html(str(div))
    frames = [table[0], table[1]]
    result=pd.concat(frames,ignore_index=True)
    print(result)
    

    使用 Pandas Dataframe 的 Selenium 选项。

    import pandas as pd
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium import webdriver
    spec_name=[]
    spec_item=[]
    driver = webdriver.Chrome()
    i = "DCD710S2"
    base_url = str("https://www.lowes.com/search?searchTerm=" + str(i))
    driver.get(base_url)
    tables=WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.XPATH,"//div[@id='collapseSpecs']//table")))
    for table in tables:
        for row in table.find_elements_by_xpath(".//tr"):
            spec_name.append(row.find_element_by_xpath('./th').get_attribute('textContent'))
            spec_item.append(row.find_element_by_xpath('./td/span').get_attribute('textContent'))
    
    df = pd.DataFrame({"Spec_Name":spec_name,"Spec_Title":spec_item})
    
    print(df)
    

    【讨论】:

    • KunduK - 非常感谢,我想我可以使用它,我只是想知道 selenium 是否有任何东西可以与 .getAttribute() 一起使用。
    • @Slavisha84 :所以你也在使用硒溶液?
    • 是的,我想看看我是否可以在不涉及美丽汤的情况下尽可能简单地编写它,但如果 Selenium 没有简单的东西,我将只使用美丽的汤。我看到很少有人这样做,先计算行数和列数,然后用 selenium 创建表,然后读取第一行第一列,但这太复杂和令人困惑。
    • @Slavisha84 :也更新了 selenium 选项。
    • 这对这个项目非常有效。但是当我为 i = ["DCL510"] 切换到不同的型号时,我收到错误:NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"./ th"} 我比较了两个搜索项的 html,它们看起来和我一模一样。为什么我会收到“DCL510”的错误消息?
    【解决方案2】:

    您需要安装 lxml 才能使用:

    pip install lxml 
    

    代码:

    import pandas as pd
    
    i = "DCD710S2"
    base_url = str("https://www.lowes.com/search?searchTerm=" + str(i))
    
    df = pd.read_html(base_url)
    
    print(df)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-10-23
      • 2016-11-08
      • 1970-01-01
      • 1970-01-01
      • 2021-06-23
      • 2019-05-16
      • 1970-01-01
      • 2021-01-07
      相关资源
      最近更新 更多