从网站python selenium中提取按钮链接文本答案

【问题标题】：Extract Button link text from a website python selenium从网站python selenium中提取按钮链接文本
【发布时间】：2019-05-24 06:13:27
【问题描述】：

Here 是我要为其提取按钮链接文本的链接，但我无法这样做网站打开后，我从“选择产品”中选择一个选项，假设我选择第一个选项，即“丙烯酸涂料”，然后出现 3 种类型，即“底漆”、“中间体”、“饰面”，我想提取他们无法做到的文本。

import requests
from bs4 import BeautifulSoup
driver = webdriver.Chrome('~/chromedriver.exe')

driver.get('http://www.asianpaintsppg.com/applications/protective_products.aspx')
lst_name = ['Acrylic Coatings','Glass Flake Coatings']

for i in lst_name:
    print(i)
    driver.find_element_by_xpath("//select[@name='txtProduct']/option[text()="+"'"+str(i)+"'"+"]").click()
    page = requests.get("http://www.asianpaintsppg.com/applications/protective_products.aspx")
    soup = BeautifulSoup(page.content, 'html.parser')
    for div in soup.findAll('table', attrs={'id':'dataLstSubCat'}):
      print(div.find('a')['href'])

但我在这里得到空值。任何帮助将不胜感激。

【问题讨论】：

我想这会对你有所帮助：Previously asked similar question

标签： python-3.x selenium-webdriver web-scraping beautifulsoup web-crawler

【解决方案1】：

有一些选项可以在不使用 selenium 的情况下获取子类别。尝试使用如下所示的发布请求。

import requests
from bs4 import BeautifulSoup

url = "http://www.asianpaintsppg.com/applications/protective_products.aspx"

with requests.Session() as s:
    r = s.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
    payload['txtProduct'] = '2' #This is the dropdown number
    res = s.post(url,data=payload)
    sauce = BeautifulSoup(res.text,"lxml")
    subcat = [item.text for item in sauce.select("[id^='dataLstSubCat_']")]
    print(subcat)

你可能得到的输出：

['Primers', 'Intermediates', 'Finishes']

【讨论】：

那里好多了:-)
@SIM 如果你能解释一下你的代码，它的功能将不胜感激。谢谢..

【解决方案2】：

您想要 .text 而不是 href 以及允许页面更新的等待条件：

#dataLstSubCat a

然后在loop|comprehension中提取.text

items = [item.text for item in soup.select('#dataLstSubCat a')]

你可以用 selenium 做所有的事情——你需要一个等待条件来确保内容存在，并且需要一个额外的等待条件来让文本在迭代 1 之后发生变化。我使用 time.sleep 这是次优的。

items = [item.text for item in  WebDriverWait(driver,5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#dataLstSubCat a")))]

额外的进口：

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

您可能可以使用 POST 请求和初始 GET 来完成所有操作，因为看起来该页面使用 __doPostBack (.aspx)，其中上面下拉列表中的值用于返回子项。

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import time

driver = webdriver.Chrome() #'~/chromedriver.exe')
driver.get('http://www.asianpaintsppg.com/applications/protective_products.aspx')

lst_name = ['Acrylic Coatings','Glass Flake Coatings']

for i in lst_name:
    driver.find_element_by_xpath("//select[@name='txtProduct']/option[text()="+"'"+str(i)+"'"+"]").click()
    items = [item.text for item in  WebDriverWait(driver,5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#dataLstSubCat a")))]
    print(items)
    time.sleep(2)

【讨论】：

它给出 [] / 空列表
你使用了等待条件吗？
不，没有使用任何等待条件，而是如何在我的代码中实现？
首先尝试如上所示，因为您可能在更新 DOM 之前尝试访问。
您的代码为“丙烯酸涂料”提供此输出“丙烯酸涂料 ['底漆'、'中间体'、'饰面'] Glass Flake Coatings ['底漆'、'中间体'、'饰面']” “它是正确的，但对于“玻璃鳞片涂料”它没有提供适当的输出

【解决方案3】：

使用下面的代码。它给我以下输出。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions

driver = webdriver.Chrome('~/chromedriver.exe')
driver.get('http://www.asianpaintsppg.com/applications/protective_products.aspx')
lst_name = ['Acrylic Coatings','Glass Flake Coatings']

for i in lst_name:

    driver.find_element_by_xpath("//select[@name='txtProduct']/option[text()="+"'"+str(i)+"'"+"]").click()
    elements=WebDriverWait(driver, 10).until(expected_conditions.presence_of_all_elements_located((By.XPATH, '//table[@id="dataLstSubCat"]//tr//td//a[starts-with(@id,"dataLstSubCat_LnkBtnSubCat_")]')))
    for ele in elements:
        print(ele.text)

【讨论】：

这不是我期望的输出，请参考问题
@deepesh : 抱歉。请尝试更新的代码。
这将 O/P 作为列表中第一项的“Primers Intermediates Finishes”以及 lst_name 中的第二项的“Primers Intermediates Finishes”，这实际上不是案例
这是因为如果您看到您的代码每次都选择下拉值并给出结果，那么它就处于循环中。