【问题标题】:Python web scraping - Loop through all categories and subcategoriesPython 网页抓取 - 遍历所有类别和子类别
【发布时间】:2018-05-14 00:33:05
【问题描述】:

我正在尝试检索零售网站中的所有类别和子类别。一旦我进入该类别,我就可以使用 BeautifulSoup 来提取该类别中的每一个产品。但是,我正在为类别的循环而苦苦挣扎。我用这个作为测试网站https://www.uniqlo.com/us/en/women

如何循环浏览网站左侧的每个类别以及子类别?问题是您必须在网站显示所有子类别之前单击类别。我想将类别/子类别中的所有产品提取到 csv 文件中。这是我目前所拥有的:

import bs4
import json
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

myurl = 'https://www.uniqlo.com/us/en/women/'
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")
filename = "products.csv"
file = open(filename,"w",newline='')
product_list = []

containers = page_soup.findAll("li",{"class" : lambda L: L and 
L.startswith('grid-tile')})   #Find all li with class: grid-tile

for container in containers: 

product_container = container.findAll("div",{"class":"product-swatches"})   
product_names = product_container[0].findAll("li")

    for i in range(len(product_names)):

    try:
        product_name = product_names[i].a.img.get("alt")
        product_mod_name = product_name.split(',')[0].lstrip()
        print(product_mod_name)
    except:
        product_name = ''

    i +=1    

product = [product_mod_name]
print(product)    
product_list.append(product)

import csv

with open('products.csv','a',newline='') as file:        
    writer=csv.writer(file)
    for row in product_list:
        writer.writerow(row)

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    你可以试试这个脚本。它将遍历产品的不同类别和子类别,并解析它们的标题和价格。有几种产品具有相同的名称,它们之间的唯一区别是颜色。所以,不要把它们算作重复。我以非常紧凑的方式编写了脚本,因此请根据您的舒适度对其进行扩展:

    import requests
    from bs4 import BeautifulSoup
    
    res = requests.get('https://www.uniqlo.com/us/en/women')
    soup = BeautifulSoup(res.text, "lxml")
    
    for items in soup.select("#category-level-1 .refinement-link"):
        page = requests.get(items['href'])
        broth = BeautifulSoup(page.text,"lxml")
    
        for links in broth.select("#category-level-2 .refinement-link"):
            req = requests.get(links['href'])
            sauce = BeautifulSoup(req.text,"lxml")
    
            for data in sauce.select(".product-tile-info"):
                title = data.select(".name-link")[0].text
                price = ' '.join([item.text for item in data.select(".product-pricing span")])
                print(title.strip(),price.strip())
    

    结果如下:

    WOMEN CASHMERE CREW NECK SWEATER $79.90
    Women Extra Fine Merino Crew Neck Sweater $29.90 $19.90
    WOMEN KAWS X PEANUTS LONG-SLEEVE HOODED SWEATSHIRT $19.90
    

    【讨论】:

    • 谢谢!!我花了 2 周的时间来解决这个问题,但仍然无法弄清楚。现在让我试试这个。
    • 如果有效,请确保接受它作为答案。谢谢。
    猜你喜欢
    • 1970-01-01
    • 2011-06-30
    • 2021-04-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-07-18
    • 2011-07-30
    • 2012-12-26
    相关资源
    最近更新 更多