在表具有相同类名的情况下使用 selenium 进行抓取答案

【问题标题】：Scraping with selenium where table has same class names在表具有相同类名的情况下使用 selenium 进行抓取
【发布时间】：2024-05-21 21:55:01
【问题描述】：

我正在尝试使用硒和美丽的汤来解析表格，但我在定位和吸引课程价值时遇到了问题。似乎每一列都有相同的类名，这使得它更加困难。这是我试图解析的 html 代码的一部分：

这是表格的外观：

所以到目前为止我的编码是这样的：

driver = webdriver.Chrome()
driver.get(base_url)
driver.implicitly_wait(100)
driver.find_elements_by_class_name("plp-pod__image")[0].click()
first = driver.find_elements_by_class_name("col-6 specs__cell specs__cell--label")[0].getText()
first

所以基本上我打开 Chrom 浏览器，加载我正在寻找的项目的页面，然后寻找所有名为“col-6 specs__cell specs__cell--label”的类，并尝试从出现的第一个类中获取文本.我正在尝试解决所有 5 个维度及其值的问题。

当我执行我的代码时，我得到了这个错误：

    ---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-27-2e124acf6be5> in <module>
      3 driver.implicitly_wait(100)
      4 driver.find_elements_by_class_name("plp-pod__image")[0].click()
----> 5 first = driver.find_elements_by_class_name("col-6 specs__cell specs__cell--label")[0].getText()

IndexError: list index out of range

知道如何解析这些元素以将所有 5 个维度及其值放入 pandas 数据框吗？

我尝试像这样结合您的两个建议：

from selenium.common.exceptions import NoSuchElementException, 
NoSuchFrameException
i = "Marshalltown PT164BR"
base_url = f"https://www.homedepot.com/s/" + i +"?NCNI-5"

driver = webdriver.Chrome()
driver.get(base_url)
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, 
".plp-pod__image"))).click()
#%%

groups = driver.find_elements_by_class_name("specs__group")
data = {}
for group in groups:
    if "placeholder" not in group.get_attribute("class"):
        specs = group.find_elements_by_class_name("specs__cell")
        dimension = specs[0].text.strip()
        value = float(specs[1].text.replace("in","").strip())
        #print(dimension,":",value)
        if dimension not in data:
            data[dimension] = []
        data[dimension].append(value)
print(data)
data_frame = pd.DataFrame(data=data)
print(data_frame)

然后我进入了我用作测试的网页，以及我用作测试的项目，但它似乎没有读取正确的类，它给了我这个错误：

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-3-1f3f99bc45ee> in <module>
      5         specs = group.find_elements_by_class_name("specs__cell")
      6         dimension = specs[0].text.strip()
----> 7         value = float(specs[1].text.replace("in","").strip())
      8         #print(dimension,":",value)
      9         if dimension not in data:

ValueError: could not convert string to float:

【问题讨论】：

标签： python python-3.x selenium-webdriver xpath selenium-chromedriver

【解决方案1】：

除了上一篇文章，如果我使用这个 HTML：

<html>
<head></head>
<body>
<div class="specs__group col-12 col-lg-6" style="min-height: 39px;">
    <div class="col-6 specs__cell specs__cell--label">Blade Length (in.)</div>
    <div class="col-6 specs__cell">16 in</div>
</div>
<div class="specs__group col-12 col-lg-6" style="min-height: 39px;">
    <div class="col-6 specs__cell specs__cell--label">Blade Width (in.)</div>
    <div class="col-6 specs__cell">4.5</div>
</div>
<div class="specs__group col-12 col-lg-6" style="min-height: 39px;">
    <div class="col-6 specs__cell specs__cell--label">Product Height (in.)</div>
    <div class="col-6 specs__cell">3.63 in</div>
</div>
<div class="specs__group col-12 col-lg-6" style="min-height: 39px;">
    <div class="col-6 specs__cell specs__cell--label">Product Length (in.)</div>
    <div class="col-6 specs__cell">16 in</div>
</div>
<div class="specs__group col-12 col-lg-6" style="min-height: 39px;">
    <div class="col-6 specs__cell specs__cell--label">Product Width (in.)</div>
    <div class="col-6 specs__cell">4.5 in</div>
<div class="specs__group placeholder" style="min-height: 39px;">
    ??
</div>
</body>

您可以创建字典或数据框：

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException, NoSuchFrameException

base_url = "file:///C:/Users/.../blade.html"

driver = webdriver.Chrome()
driver.get(base_url)
groups = driver.find_elements_by_class_name("specs__group")
data = {}
for group in groups:
    if "placeholder" not in group.get_attribute("class"):
        specs = group.find_elements_by_class_name("specs__cell")
        dimension = specs[0].text.strip()
        value = float(specs[1].text.replace("in","").strip())
        #print(dimension,":",value)
        if dimension not in data:
            data[dimension] = []
        data[dimension].append(value)
print(data)
data_frame = pd.DataFrame(data=data)
print(data_frame)

【讨论】：

我认为这个解决方案正是我所需要的。因此，当我加载页面并继续使用其余代码时，会出现此错误：ValueError：无法将字符串转换为浮点数。知道为什么会这样吗？
我在原始帖子中也添加了我的示例和您的代码。
text_value = specs[1].text.replace("in","").strip() print(text_value) 你想要的输出应该是一个数字，但是你会得到一个像 " Steel”，因为网站上的“详细信息”表也有类名 specs__group。检查 len(groups)，你想要大约 5 并得到 22。迭代你的路径到表“维度”，如：id specsContainer > class specs__table[0] ... 测试你是否通过输出到控制台得到你想要的。

【解决方案2】：

这里是获取产品尺寸的代码。

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome()
i = "Marshalltown PT164BR"
base_url ="https://www.homedepot.com/s/" + i +"?NCNI-5"
driver.get(base_url)
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".plp-pod__image"))).click()
Dimensions_Type=[]
Dimention_Size=[]
elements=WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.XPATH, "(//h4[text()='Dimensions']/following::div[contains(@class,'specs__table')])[1]/div")))
for ele in elements:
  if "placeholder" not in ele.get_attribute("class"):
     DimensionsType=ele.find_element_by_xpath(".//div[@class='col-6 specs__cell specs__cell--label']").get_attribute("textContent")
     DimentionSize=ele.find_element_by_xpath(".//div[@class='col-6 specs__cell specs__cell--label']/following-sibling::div[1]").get_attribute("textContent")
     Dimensions_Type.append(DimensionsType)
     Dimention_Size.append(DimentionSize)

df=pd.DataFrame({"DimensionSize":Dimention_Size,"DimensionType":Dimensions_Type})
print(df)

控制台输出：

    DimensionSize       DimensionType
0         16 in    Blade Length (in.)
1           4.5     Blade Width (in.)
2       3.63 in  Product Height (in.)
3         16 in  Product Length (in.)
4        4.5 in   Product Width (in.)

【讨论】：

优秀的昆杜克。这是一个。太感谢了。我理解每一行代码，这是完美的。