【问题标题】:Scraping through every product on retailer website抓取零售商网站上的所有产品
【发布时间】:2017-03-17 15:25:12
【问题描述】:

我们正在尝试在 Forever 21 网站上为每个类别抓取所有产品。给定一个产品页面,我们知道如何提取我们需要的信息,给定一个类别,我们可以提取每个产品。但是,我们不知道如何爬取每个产品类别。这是我们针对给定类别并获取每个产品的代码:

import requests
from bs4 import BeautifulSoup
import json
#import re

params = {"action": "getcategory",
          "br": "f21",
          #"category": re.compile('\S+'),
          "category": "dress",
          "pageno": 1,
          "pagesize": "",
          "sort": "",
          "fsize": "",
          "fcolor": "",
          "fprice": "",
          "fattr": ""}

url = "http://www.forever21.com/Ajax/Ajax_Category.aspx"
js = requests.get(url, params=params).json()
soup = BeautifulSoup(js[u'CategoryHTML'], "html.parser")
i = 0
j = 0

while len(soup.select("div.item_pic a")) != 0:
   for a in soup.select("div.item_pic a"):
      #print a["href"]
      i = i + 1

   params["pageno"] = params["pageno"] + 1
   j = j + 1
   js = requests.get(url, params=params).json()
   soup = BeautifulSoup(js[u'CategoryHTML'], "html.parser")

print i
print j

正如您在 cmets 中看到的,我们尝试对类别使用正则表达式,但没有成功。 i 和 j 只是产品和页面计数器。有关如何修改/添加此代码以获取每个产品类别的任何建议?

【问题讨论】:

    标签: python web-scraping beautifulsoup web-crawler python-requests


    【解决方案1】:

    您可以从导航菜单中抓取类别页面并获取所有子类别:

    import requests
    from bs4 import BeautifulSoup
    
    
    url = "http://www.forever21.com/Product/Category.aspx?br=f21&category=app-main"
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"})
    
    soup = BeautifulSoup(response.content, "html.parser")
    menues = [li["class"][0] for li in soup.select("#has_sub .white nav ul > li")]
    print(menues)
    

    打印:

    [u'women-new-arrivals', u'want_list', u'dress', u'top_blouses', u'outerwear_coats-and-jackets', u'bottoms', u'intimates_loungewear', u'activewear', u'swimwear_all', u'acc', u'shoes', u'branded-shop-women-clothing', u'sale_women|women', u'women-new-arrivals-clothing-dresses', u'women-new-arrivals-clothing-tops', u'women-new-arrivals-clothing-outerwear', u'women-new-arrivals-clothing-bottoms', u'women-new-arrivals-clothing-intimates-loungewear', u'women-new-arrivals-clothing-swimwear', u'women-new-arrivals-clothing-activewear', u'women-new-arrivals-accessories|women-new-arrivals', u'women-new-arrivals-shoes|women-new-arrivals', u'promo-web-exclusives', u'promo-best-sellers-app', u'backinstock-women', u'promo-shop-by-outfit-women', u'occasion-shop-wedding', u'contemporary-main', u'promo-basics', u'21_items', u'promo-summer-forever', u'promo-coming-soon', u'dress_casual', u'dress_romper', u'dress_maxi', u'dress_midi', u'dress_mini', u'occasion-shop-dress', u'top_blouses-off-shoulder', u'top_blouses-lace-up', u'top_bodysuits-bustiers', u'top_graphic-tops', u'top_blouses-crop-top', u'top_t-shirts', u'sweater', u'top_blouses-sweatshirts-hoodies', u'top_blouses-shirts', u'top_plaids', u'outerwear_bomber-jackets', u'outerwear_blazers', u'outerwear_leather-suede', u'outerwear_jean-jackets', u'outerwear_lightweight', u'outerwear_utility-jackets', u'outerwear_trench-coats', u'outerwear_faux-fur', u'promo-jeans-refresh|bottoms', u'bottoms_pants', u'bottoms_skirt', u'bottoms_shorts', u'bottoms_shorts-active', u'bottoms_leggings', u'bottoms_sweatpants', u'bottom_jeans|', u'intimates_loungewear-bras', u'intimates_loungewear-panties', u'intimates_loungewear-bodysuits-slips', u'intimates_loungewear-seamless', u'intimates_loungewear-accessories', u'intimates_loungewear-sets', u'activewear_top', u'activewear_sports-bra', u'activewear_bottoms', u'activewear_accessories', u'swimwear_tops', u'swimwear_bottoms', u'swimwear_one-piece', u'swimwear_cover-ups', u'acc_features', u'acc_jewelry', u'acc_handbags', u'acc_glasses', u'acc_hat', u'acc_hair', u'acc_legwear', u'acc_scarf-gloves', u'acc_home-and-gift-items', u'shoes_features', u'shoes_boots', u'shoes_high-heels', u'shoes_sandalsflipflops', u'shoes_wedges', u'shoes_flats', u'shoes_oxfords-loafers', u'shoes_sneakers', u'Shoes_slippers', u'branded-shop-new-arrivals-women', u'branded-shop-women-clothing-dresses', u'branded-shop-women-clothing-tops', u'branded-shop-women-clothing-outerwear', u'branded-shop-women-clothing-bottoms', u'branded-shop-women-clothing-intimates', u'branded-shop-women-accessories|branded-shop-women-clothing', u'branded-shop-women-accessories-jewelry|', u'branded-shop-shoes-women|branded-shop-women-clothing', u'branded-shop-sale-women', u'/brandedshop/brandlist.aspx', u'promo-branded-boho-me', u'promo-branded-rare-london', u'promo-branded-selfie-leslie', u'sale-newly-added', u'sale_dresses', u'sale_tops', u'sale_outerwear', u'sale_sweaters', u'sale_bottoms', u'sale_intimates', u'sale_swimwear', u'sale_activewear', u'sale_acc', u'sale_shoes', u'the-outlet', u'sale-under-5', u'sale-under-10', u'sale-under-15']
    

    注意brcategory GET 参数的值。 f21 是“女性”类别,app-main 是类别的主页。

    【讨论】:

    • 感谢您的帮助!澄清一下,这只会获取所有带有 br=f21 的类别,对吗?
    • @TerryRossi 是的,f21 类别的子类别。您还可以从主商店页面抓取顶级类别。
    猜你喜欢
    • 1970-01-01
    • 2021-06-28
    • 2018-06-09
    • 1970-01-01
    • 2012-05-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多