在 Selenium 中进行 Web 抓取的循环答案

【问题标题】：Loop For Webscraping In Selenium在 Selenium 中进行 Web 抓取的循环
【发布时间】：2021-08-01 13:57:27
【问题描述】：

我需要把这个页面的所有产品都刮掉：website

所以我需要点击每张照片，然后抓取其中的数据。

我设法编写了用于抓取内部数据的脚本。

我必须提取名称、价格、描述……

下面是我的代码：

import scrapy
from scrapy_selenium import SeleniumRequest
from scrapy.selector import Selector
from selenium.webdriver.common.keys import Keys
from scrapy_splash import SplashRequest
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from shutil import which

class AsoswomennewSpider(scrapy.Spider):
    name = 'asoswomennew'
    allowed_domains = ['www.asos.com']
    start_urls = ['https://www.asos.com/monki/monki-lisa-cropped-vest-top-with-ruched-side-in-black/prd/23590636?colourwayid=60495910&cid=2623']


def __init__(self):
    chrome_option = Options()
    chrome_option.add_argument("--headless")
    chrome_path = which("chromedriver")
    driver = webdriver.Chrome(executable_path=chrome_path, options = chrome_option)
    driver.set_window_size(1920, 1080)
    driver.get('https://www.asos.com/monki/monki-lisa-cropped-vest-top-with-ruched-side-in-black/prd/23590636?colourwayid=60495910&cid=2623')


def parse(self, response):

           yield{
           'name':response.xpath("//div[@class='product-hero']/h1/text()").get(),
           'price':response.css('//*[contains(@class, "current-price")]').get(),
           'description':response.xpath("//div[@class='product-description']/ul/li/span/text()").getall(),
           'about_me':response.xpath("//div[@class='about-me']/p/text()").getall(),
           'brand_description':response.xpath("//div[@class='brand-description']/p/text()").getall()
        }

现在我需要遍历每张图片，然后执行上面的脚本。

pictures to loop

有人可以帮帮我吗？

谢谢！

P.S 我的 start_url 需要更改为这个 'https://www.asos.com/women/new-in/new-in-clothing/cat/?cid=2623&nlid= ww|new+in|new+products|服装'

因为这是主（主页）网页，所以我需要为每个项目设置一个回调 url。

【问题讨论】：

标签： python selenium web-scraping xpath scrapy

【解决方案1】：

我可以看到产品包装在article 标记中。

每个article 标签都有一个a 标签，它基本上由指向该产品的链接组成。

您可以在主页中的每个article 标记中获取a 标记并将它们存储在list 中。假设list 的名称是products_list。

driver.get() 之后是这样的：

products_list = driver.find_elements_by_css_selector('article a')

然后从列表中提取每个a标签的href值并将它们存储在另一个名为products_links的list中

products_links = []
for each in products_list:
    products_links.append(each.get_attribute('href'))

现在，您所要做的就是遍历products_links 并打开每一个，解析您需要的数据。就像您对单个产品所做的那样

【讨论】：