【问题标题】:How to get all pages from webscraping如何从网页抓取中获取所有页面
【发布时间】:2020-07-15 02:23:42
【问题描述】:

我正在尝试从该网站https://www.dickssportinggoods.com/f/all-mens-footwear 的所有页面中获取所有鞋子的列表,但我不知道在我的代码中还要写什么。 基本上我想从网站的所有页面中选择一个品牌鞋。例如,我想选择 New Balance 鞋,我想打印我选择的名称 branc 的所有鞋的列表。下面是我的代码

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
Url = 'https://www.dickssportinggoods.com/f/all-mens-footwear'
uClient = uReq(Url)
Page = uClient.read()
uClient.close()
page_soup = soup(Page, "html.parser")
for i in page_soup.findAll("div", {"class":"rs-facet-name-container"}):
    print(i.text)

【问题讨论】:

    标签: python python-3.x list selenium-webdriver web-scraping


    【解决方案1】:

    该网站正在使用 js 脚本更新其元素,因此您将无法单独使用 beautifulsoup,您必须使用自动化。

    下面的代码不起作用,因为元素会在几毫秒后更新。它最初会显示所有品牌,然后会更新并显示所选品牌,因此请使用自动化。

    失败的代码:

    from bs4 import BeautifulSoup as soup
    import time
    from urllib.request import urlopen as uReq
    Url = 'https://www.dickssportinggoods.com/f/all-mens-footwear'
    url_st = 'https://www.dickssportinggoods.com/f/mens-athletic-shoes?pageNumber=0&filterFacets=X_BRAND'
    
    for idx, br in enumerate(brands_name):
        if idx==0:
            url_st += '%3A'+ '%20'.join(br.split(' '))
        else: 
            url_st += '%2C' + '%20'.join(br.split(' '))
    
    uClient = uReq(url_st)
    time.sleep(4)
    Page = uClient.read()
    uClient.close()
    
    page_soup = soup(Page, "html.parser") 
    for match in page_soup.find_all('div', class_='rs_product_description d-block'):
        print(match.text)
    

    代码:(selenium + bs4)

    from bs4 import BeautifulSoup as soup
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    import time
    from webdriver_manager.chrome import ChromeDriverManager
    
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    driver = webdriver.Chrome(ChromeDriverManager().install())#, chrome_options=chrome_options)
    driver.set_window_size(1024, 600)
    driver.maximize_window()
    
    brands_name = ['New Balance']
    
    filter_facet ='filterFacets=X_BRAND'
    for idx, br in enumerate(brands_name):
        if idx==0:
            filter_facet += '%3A'+ '%20'.join(br.split(' '))
        else: 
            filter_facet += '%2C' + '%20'.join(br.split(' '))
    
    url = f"https://www.dickssportinggoods.com/f/mens-athletic-shoes?pageNumber=0&{filter_facet}"        
    driver.get(url)
    time.sleep(4)
    page_soup = soup(driver.page_source, 'html.parser')  
    elem = driver.find_element_by_class_name('close')
    if elem:
        elem.click()
    for match in page_soup.find_all('div', class_='rs_product_description d-block'):
        print(match.text)
        
    page_num = page_soup.find_all('a', class_='rs-page-item')
    pnum = [int(pn.text) for pn in page_num if pn.text!='']
    if len(pnum)>=2:
        for pn in range(1, len(pnum)):
            url = f"https://www.dickssportinggoods.com/f/mens-athletic-shoes?pageNumber={pn}&{filter_facet}"
            driver.get(url)
            time.sleep(2)
            page_soup = soup(driver.page_source, "html.parser") 
            for match in page_soup.find_all('div', class_='rs_product_description d-block'):
                print(match.text)
    

    New Balance Men's 410v6 Trail Running Shoes
    New Balance Men's 623v3 Training Shoes
    .
    .
    .
    New Balance Men's Fresh Foam Beacon Running Shoes
    New Balance Men's Fresh Foam Cruz v2 SockFit Running Shoes
    New Balance Men's 470 Running Shoes
    New Balance Men's 996v3 Tennis Shoes
    New Balance Men's 1260 V7 Running Shoes
    New Balance Men's Fresh Foam Beacon Running Shoes
    

    我已经注释掉了无标题的 chrome,因为当你打开它时,你会在关闭它后得到一个对话框按钮,你可以获取产品详细信息。在无浏览器自动化中,您将无法做到(无法回答这个问题。硒概念不太好)

    别忘了安装:webdriver_manager 使用pip install webdriver_manager

    【讨论】:

    • 如果此答案有帮助,请不要忘记通过单击勾选将其标记为已接受的答案。如果没有,请告诉我。
    【解决方案2】:

    您可以单击过滤器按钮并检查您想要的所有品牌。 你只需要做driver.find element by xpath() 如果你使用 selenium,你必须知道这一点。

    【讨论】:

    • 我在用漂亮的汤
    【解决方案3】:

    该页面正在使用 java 脚本创建您想要的链接,您无法抓取该链接,您需要复制页面请求,在这种情况下该页面正在发送一个发布请求:

    Request URL: https://prod-catalog-product-api.dickssportinggoods.com/v1/search
    Request Method: POST
    Status Code: 200 OK
    Remote Address: [2600:1400:d:696::25db]:443
    Referrer Policy: no-referrer-when-downgrade
    

    使用浏览器中的检查元素工具检查请求标头以模拟发布请求

    这是发送帖子请求的网址:

    https://prod-catalog-product-api.dickssportinggoods.com/v1/search
    

    这是浏览器发送的帖子信息

    {selectedCategory: "12301_1714863", selectedStore: "1406", selectedSort: 1,…}
    isFamilyPage: true
    pageNumber: 0
    pageSize: 48
    searchTypes: []
    selectedCategory: "12301_1714863"
    selectedFilters: {X_BRAND: ["New Balance"]}   #<--- this is the info that you want to get
    selectedSort: 1
    selectedStore: "1406"
    storeId: 15108
    totalCount: 3360
    

    该页面可能还需要标头,因此请确保模拟浏览器发送的请求。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2017-01-16
      • 2022-11-02
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-03-20
      • 1970-01-01
      相关资源
      最近更新 更多