【问题标题】:Scraping with selenium where table has same class names在表具有相同类名的情况下使用 selenium 进行抓取
【发布时间】:2024-05-21 21:55:01
【问题描述】:

我正在尝试使用硒和美丽的汤来解析表格,但我在定位和吸引课程价值时遇到了问题。似乎每一列都有相同的类名,这使得它更加困难。这是我试图解析的 html 代码的一部分:

这是表格的外观:

所以到目前为止我的编码是这样的:

driver = webdriver.Chrome()
driver.get(base_url)
driver.implicitly_wait(100)
driver.find_elements_by_class_name("plp-pod__image")[0].click()
first = driver.find_elements_by_class_name("col-6 specs__cell specs__cell--label")[0].getText()
first

所以基本上我打开 Chrom 浏览器,加载我正在寻找的项目的页面,然后寻找所有名为“col-6 specs__cell specs__cell--label”的类,并尝试从出现的第一个类中获取文本.我正在尝试解决所有 5 个维度及其值的问题。

当我执行我的代码时,我得到了这个错误:

    ---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-27-2e124acf6be5> in <module>
      3 driver.implicitly_wait(100)
      4 driver.find_elements_by_class_name("plp-pod__image")[0].click()
----> 5 first = driver.find_elements_by_class_name("col-6 specs__cell specs__cell--label")[0].getText()

IndexError: list index out of range

知道如何解析这些元素以将所有 5 个维度及其值放入 pandas 数据框吗?

我尝试像这样结合您的两个建议:

from selenium.common.exceptions import NoSuchElementException, 
NoSuchFrameException
i = "Marshalltown PT164BR"
base_url = f"https://www.homedepot.com/s/" + i +"?NCNI-5"

driver = webdriver.Chrome()
driver.get(base_url)
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, 
".plp-pod__image"))).click()
#%%

groups = driver.find_elements_by_class_name("specs__group")
data = {}
for group in groups:
    if "placeholder" not in group.get_attribute("class"):
        specs = group.find_elements_by_class_name("specs__cell")
        dimension = specs[0].text.strip()
        value = float(specs[1].text.replace("in","").strip())
        #print(dimension,":",value)
        if dimension not in data:
            data[dimension] = []
        data[dimension].append(value)
print(data)
data_frame = pd.DataFrame(data=data)
print(data_frame)

然后我进入了我用作测试的网页,以及我用作测试的项目,但它似乎没有读取正确的类,它给了我这个错误:

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-3-1f3f99bc45ee> in <module>
      5         specs = group.find_elements_by_class_name("specs__cell")
      6         dimension = specs[0].text.strip()
----> 7         value = float(specs[1].text.replace("in","").strip())
      8         #print(dimension,":",value)
      9         if dimension not in data:

ValueError: could not convert string to float:

【问题讨论】:

    标签: python python-3.x selenium-webdriver xpath selenium-chromedriver


    【解决方案1】:

    除了上一篇文章,如果我使用这个 HTML:

    <html>
    <head></head>
    <body>
    <div class="specs__group col-12 col-lg-6" style="min-height: 39px;">
        <div class="col-6 specs__cell specs__cell--label">Blade Length (in.)</div>
        <div class="col-6 specs__cell">16 in</div>
    </div>
    <div class="specs__group col-12 col-lg-6" style="min-height: 39px;">
        <div class="col-6 specs__cell specs__cell--label">Blade Width (in.)</div>
        <div class="col-6 specs__cell">4.5</div>
    </div>
    <div class="specs__group col-12 col-lg-6" style="min-height: 39px;">
        <div class="col-6 specs__cell specs__cell--label">Product Height (in.)</div>
        <div class="col-6 specs__cell">3.63 in</div>
    </div>
    <div class="specs__group col-12 col-lg-6" style="min-height: 39px;">
        <div class="col-6 specs__cell specs__cell--label">Product Length (in.)</div>
        <div class="col-6 specs__cell">16 in</div>
    </div>
    <div class="specs__group col-12 col-lg-6" style="min-height: 39px;">
        <div class="col-6 specs__cell specs__cell--label">Product Width (in.)</div>
        <div class="col-6 specs__cell">4.5 in</div>
    <div class="specs__group placeholder" style="min-height: 39px;">
        ??
    </div>
    </body>
    

    您可以创建字典或数据框:

    from bs4 import BeautifulSoup
    import pandas as pd
    from selenium import webdriver
    from selenium.common.exceptions import NoSuchElementException, NoSuchFrameException
    
    base_url = "file:///C:/Users/.../blade.html"
    
    driver = webdriver.Chrome()
    driver.get(base_url)
    groups = driver.find_elements_by_class_name("specs__group")
    data = {}
    for group in groups:
        if "placeholder" not in group.get_attribute("class"):
            specs = group.find_elements_by_class_name("specs__cell")
            dimension = specs[0].text.strip()
            value = float(specs[1].text.replace("in","").strip())
            #print(dimension,":",value)
            if dimension not in data:
                data[dimension] = []
            data[dimension].append(value)
    print(data)
    data_frame = pd.DataFrame(data=data)
    print(data_frame)
    

    【讨论】:

    • 我认为这个解决方案正是我所需要的。因此,当我加载页面并继续使用其余代码时,会出现此错误:ValueError:无法将字符串转换为浮点数。知道为什么会这样吗?
    • 我在原始帖子中也添加了我的示例和您的代码。
    • text_value = specs[1].text.replace("in","").strip() print(text_value) 你想要的输出应该是一个数字,但是你会得到一个像 " Steel”,因为网站上的“详细信息”表也有类名 specs__group。检查 len(groups),你想要大约 5 并得到 22。迭代你的路径到表“维度”,如:id specsContainer > class specs__table[0] ... 测试你是否通过输出到控制台得到你想要的。
    【解决方案2】:

    这里是获取产品尺寸的代码。

    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium import webdriver
    import pandas as pd
    driver = webdriver.Chrome()
    i = "Marshalltown PT164BR"
    base_url ="https://www.homedepot.com/s/" + i +"?NCNI-5"
    driver.get(base_url)
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".plp-pod__image"))).click()
    Dimensions_Type=[]
    Dimention_Size=[]
    elements=WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.XPATH, "(//h4[text()='Dimensions']/following::div[contains(@class,'specs__table')])[1]/div")))
    for ele in elements:
      if "placeholder" not in ele.get_attribute("class"):
         DimensionsType=ele.find_element_by_xpath(".//div[@class='col-6 specs__cell specs__cell--label']").get_attribute("textContent")
         DimentionSize=ele.find_element_by_xpath(".//div[@class='col-6 specs__cell specs__cell--label']/following-sibling::div[1]").get_attribute("textContent")
         Dimensions_Type.append(DimensionsType)
         Dimention_Size.append(DimentionSize)
    
    df=pd.DataFrame({"DimensionSize":Dimention_Size,"DimensionType":Dimensions_Type})
    print(df)
    

    控制台输出:

        DimensionSize       DimensionType
    0         16 in    Blade Length (in.)
    1           4.5     Blade Width (in.)
    2       3.63 in  Product Height (in.)
    3         16 in  Product Length (in.)
    4        4.5 in   Product Width (in.)
    

    【讨论】:

    • 优秀的昆杜克。这是一个。太感谢了。我理解每一行代码,这是完美的。
    最近更新 更多