【问题标题】:Attempting to generate links from all products on website using Selenium尝试使用 Selenium 从网站上的所有产品生成链接
【发布时间】:2019-04-28 05:19:44
【问题描述】:

脚本的主要目标是为网站上所有可用的产品生成链接,产品根据类别进行隔离。

我遇到的问题是我只能为一个类别(输液)生成链接,特别是我保存的 URL。第二个类别或 URL,我想包括在这里:https://www.vatainc.com/wound-care.html

有没有一种方法可以遍历多个类别 URL,与我已有的脚本具有相同的效果?

这是我的代码:

import time
import csv
from selenium import webdriver
import selenium.webdriver.chrome.service as service
import requests
from bs4 import BeautifulSoup

all_product = []

url = "https://www.vatainc.com/infusion.html?limit=all"
service = service.Service('/Users/Jon/Downloads/chromedriver.exe')
service.start()
capabilities = {'chrome.binary': '/Google/Chrome/Application/chrome.exe'}
driver = webdriver.Remote(service.service_url, capabilities)
driver.get(url)
time.sleep(2)
links = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//*[contains(@class, 'product-name')]/a")]


 for link in links:
    html = requests.get(link).text
    soup = BeautifulSoup(html, "html.parser")
    products = soup.findAll("div", {"class": "product-view"})
    print(links)

这是一些输出,这个 URL 大约有 52 个链接。

['https://www.vatainc.com/infusion/0705-vascular-access-ultrasound-phantom-1616.html', 'https://www.vatainc.com/infusion/0751-simulated-ultrasound-blood.html', 'https://www.vatainc.com/infusion/body-skin-shell-0242.html', 'https://www.vatainc.com/infusion/2366-advanced-four-vein-venipuncture-training-aidtm-dermalike-iitm-latex-free-1533.html',

【问题讨论】:

    标签: python selenium selenium-webdriver beautifulsoup


    【解决方案1】:

    您可以循环浏览 2 个网址。但是,如果您正在寻找一种方法来先提取它们,然后再循环,这很有效:

    import time
    import csv
    from selenium import webdriver
    import selenium.webdriver.chrome.service as service
    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    
    root_url = 'https://www.vatainc.com/'
    service = service.Service('C:\chromedriver_win32\chromedriver.exe')
    service.start()
    capabilities = {'chrome.binary': '/Google/Chrome/Application/chrome.exe'}
    driver = webdriver.Remote(service.service_url, capabilities)
    driver.get(root_url)
    time.sleep(2)
    
    # Grab the urls, but only keep the ones of interest
    urls = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//ol[contains(@class, 'nav-primary')]/li/a")]
    urls = [ x for x in urls if 'html' in x ] 
    
    # It produces duplicates, so drop those and include ?limit=all to query all products
    urls_list = pd.Series(urls).drop_duplicates().tolist()
    urls_list = [ x +'?limit=all' for x in urls_list]
    
    driver.close()
    
    
    all_product = []
    
    # loop through those urls and the links to generate a final product list
    for url in urls_list:
    
        print ('Url: '+url)
        driver = webdriver.Remote(service.service_url, capabilities)
        driver.get(url)
        time.sleep(2)
        links = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//*[contains(@class, 'product-name')]/a")]
    
    
        for link in links:
            html = requests.get(link).text
            soup = BeautifulSoup(html, "html.parser")
            products = soup.findAll("div", {"class": "product-view"})
            all_product.append(link)
            print(link)
    
        driver.close()
    

    生成 303 个链接的列表

    【讨论】:

      【解决方案2】:

      只需使用一个简单的 for 循环来枚举两个 URL:

      import time
      import csv
      from selenium import webdriver
      import selenium.webdriver.chrome.service as service
      import requests
      from bs4 import BeautifulSoup
      
      all_product = []
      
      urls = ["website", "website2"]
      service = service.Service('/Users/Jonathan/Downloads/chromedriver.exe')
      service.start()
      capabilities = {'chrome.binary': '/Google/Chrome/Application/chrome.exe'}
      driver = webdriver.Remote(service.service_url, capabilities)
      for index, url in enumerate(urls):
          driver.get(url)
          time.sleep(2)
          links = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//*[contains(@class, 'product-name')]/a")]
          
      
      
      for link in links:
          html = requests.get(link).text
          soup = BeautifulSoup(html, "html.parser")
          products = soup.findAll("div", {"class": "product-view"})
          print(links)
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2019-07-27
        • 2017-03-17
        • 2013-10-04
        • 1970-01-01
        • 2015-09-22
        • 1970-01-01
        相关资源
        最近更新 更多