尝试使用 Selenium 从网站上的所有产品生成链接答案

【问题标题】：Attempting to generate links from all products on website using Selenium尝试使用 Selenium 从网站上的所有产品生成链接
【发布时间】：2019-04-28 05:19:44
【问题描述】：

脚本的主要目标是为网站上所有可用的产品生成链接，产品根据类别进行隔离。

我遇到的问题是我只能为一个类别（输液）生成链接，特别是我保存的 URL。第二个类别或 URL，我想包括在这里：https://www.vatainc.com/wound-care.html

有没有一种方法可以遍历多个类别 URL，与我已有的脚本具有相同的效果？

这是我的代码：

import time
import csv
from selenium import webdriver
import selenium.webdriver.chrome.service as service
import requests
from bs4 import BeautifulSoup

all_product = []

url = "https://www.vatainc.com/infusion.html?limit=all"
service = service.Service('/Users/Jon/Downloads/chromedriver.exe')
service.start()
capabilities = {'chrome.binary': '/Google/Chrome/Application/chrome.exe'}
driver = webdriver.Remote(service.service_url, capabilities)
driver.get(url)
time.sleep(2)
links = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//*[contains(@class, 'product-name')]/a")]


 for link in links:
    html = requests.get(link).text
    soup = BeautifulSoup(html, "html.parser")
    products = soup.findAll("div", {"class": "product-view"})
    print(links)

这是一些输出，这个 URL 大约有 52 个链接。

['https://www.vatainc.com/infusion/0705-vascular-access-ultrasound-phantom-1616.html', 'https://www.vatainc.com/infusion/0751-simulated-ultrasound-blood.html', 'https://www.vatainc.com/infusion/body-skin-shell-0242.html', 'https://www.vatainc.com/infusion/2366-advanced-four-vein-venipuncture-training-aidtm-dermalike-iitm-latex-free-1533.html',

【问题讨论】：

标签： python selenium selenium-webdriver beautifulsoup

【解决方案1】：

您可以循环浏览 2 个网址。但是，如果您正在寻找一种方法来先提取它们，然后再循环，这很有效：

import time
import csv
from selenium import webdriver
import selenium.webdriver.chrome.service as service
import requests
from bs4 import BeautifulSoup
import pandas as pd


root_url = 'https://www.vatainc.com/'
service = service.Service('C:\chromedriver_win32\chromedriver.exe')
service.start()
capabilities = {'chrome.binary': '/Google/Chrome/Application/chrome.exe'}
driver = webdriver.Remote(service.service_url, capabilities)
driver.get(root_url)
time.sleep(2)

# Grab the urls, but only keep the ones of interest
urls = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//ol[contains(@class, 'nav-primary')]/li/a")]
urls = [ x for x in urls if 'html' in x ] 

# It produces duplicates, so drop those and include ?limit=all to query all products
urls_list = pd.Series(urls).drop_duplicates().tolist()
urls_list = [ x +'?limit=all' for x in urls_list]

driver.close()


all_product = []

# loop through those urls and the links to generate a final product list
for url in urls_list:

    print ('Url: '+url)
    driver = webdriver.Remote(service.service_url, capabilities)
    driver.get(url)
    time.sleep(2)
    links = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//*[contains(@class, 'product-name')]/a")]


    for link in links:
        html = requests.get(link).text
        soup = BeautifulSoup(html, "html.parser")
        products = soup.findAll("div", {"class": "product-view"})
        all_product.append(link)
        print(link)

    driver.close()

生成 303 个链接的列表

【讨论】：

【解决方案2】：

只需使用一个简单的 for 循环来枚举两个 URL：

import time
import csv
from selenium import webdriver
import selenium.webdriver.chrome.service as service
import requests
from bs4 import BeautifulSoup

all_product = []

urls = ["website", "website2"]
service = service.Service('/Users/Jonathan/Downloads/chromedriver.exe')
service.start()
capabilities = {'chrome.binary': '/Google/Chrome/Application/chrome.exe'}
driver = webdriver.Remote(service.service_url, capabilities)
for index, url in enumerate(urls):
    driver.get(url)
    time.sleep(2)
    links = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//*[contains(@class, 'product-name')]/a")]
    


for link in links:
    html = requests.get(link).text
    soup = BeautifulSoup(html, "html.parser")
    products = soup.findAll("div", {"class": "product-view"})
    print(links)

【讨论】：