【问题标题】:web scraping from darak.pk giving error AttributeError: 'NoneType' object has no attribute 'find_all'从 darak.pk 抓取网页给出错误 AttributeError: 'NoneType' object has no attribute 'find_all'
【发布时间】:2021-02-05 08:02:07
【问题描述】:

我正在尝试从网站 daraz.pk 抓取数据,这是我迄今为止在 jupyter notebook 中编写的代码:

import requests
from bs4 import BeautifulSoup as soup
from time import sleep
#url of the website we want to scrape which in this case is the url of daraz.pk for swimsuits
my_url = "https://www.daraz.pk/catalog/?spm=a2a0e.home.search.1.35e349376res9Z&q=swimsuits&_keyori=ss&from=search_history&sugg=swimsuits_0_1"
page = requests.get(my_url)
pagesrc = soup(page.text, 'html.parser')
#making a container to save all the data in
container = pagesrc.find('div', {'class':'c1_t2i'})
#our gallery is the product-item
gallery = container.find_all('div', {'class':'c2prKC'})
sleep(1)

这是我得到的错误:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-4fa8d4bc2410> in <module>
      6 container = pagesrc.find('div', {'class':'c1_t2i'})
      7 #our gallery is the product-item
----> 8 gallery = container.find_all('div', {'class':'c2prKC'})
      9 sleep(1)

AttributeError: 'NoneType' object has no attribute 'find_all'

我对网络抓取非常陌生,我试图遵循堆栈溢出的答案,该答案出现在同一主题的另一个问题中,但没有帮助。这是问题

Python error: 'NoneType' object has no attribute 'find_all'

我们将不胜感激!

【问题讨论】:

    标签: html python-3.x web-scraping beautifulsoup python-requests


    【解决方案1】:

    您看到的数据以 Javascript/Json 格式嵌入页面中。要解析它,您可以使用以下示例:

    import re
    import json
    import requests 
    
    
    url = 'https://www.daraz.pk/catalog/?spm=a2a0e.home.search.1.35e349376res9Z&q=swimsuits&_keyori=ss&from=search_history&sugg=swimsuits_0_1'
    html_doc = requests.get(url).text
    data = json.loads(re.search(r'window\.pageData=({.*})', html_doc).group(1))
    
    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))
    
    # print some data to screen:
    for item in data['mods']['listItems']:
        print(item['name'])
        print(item['price'])
        print(item['image'])
        print('-' * 80)
    

    打印:

    Toppick Sexy Deep V neck Swimwear Women Print Backless Bandage Cut Out Monokini Badpak One Piece Swimsuit Women bathing suit
    2699.00
    https://static-01.daraz.pk/p/ce73ff4ac121ad3f401a753d548c641b.jpg
    --------------------------------------------------------------------------------
    Toppick Sexy Deep V neck Swimwear Women Print Backless Bandage Cut Out Monokini Badpak One Piece Swimsuit Women bathing suit
    2699.00
    https://static-01.daraz.pk/p/2329f166ddb75a6942d010e7abb14e66.jpg
    --------------------------------------------------------------------------------
    Soft One Piece Bandage Push Up Bikini Women's Swimwear Swimsuit Bathing Suit Bottoms Thong Summer Beach Triangle Costume Small Size Black Color
    1900.00
    https://static-01.daraz.pk/p/3867c6255fdc5becb8b347ca78ee20a8.jpg
    --------------------------------------------------------------------------------
    Swimwear for Women  One Piece Sleeveless Women’ s Swimsuit Pure Color Bathing Suit
    1510.00
    https://static-01.daraz.pk/p/9e748d76cbeee1da8bcedb412f60902f.png
    --------------------------------------------------------------------------------
    Women Sexy Swimwear Bra God Save Queens Letters(S)
    1199.00
    https://static-01.daraz.pk/p/a60750bd34113a2cf55dd31c715cb47c.jpg
    --------------------------------------------------------------------------------
    
    ...and so on.
    

    【讨论】:

      【解决方案2】:

      这是因为该网站是使用 javascript 动态加载的(我在 Stack Overflow 的答案中几乎有一半都提到了这一点 XD)。你可以使用selenium来解决这个问题:

      from selenium import webdriver
      my_url = "https://www.daraz.pk/catalog/?spm=a2a0e.home.search.1.35e349376res9Z&q=swimsuits&_keyori=ss&from=search_history&sugg=swimsuits_0_1"
      driver = webdriver.Chrome()
      driver.get(my_url)
      sleep(4)
      page = driver.page_source
      driver.close()
      

      因此,这里是完整的代码:

      import requests
      from bs4 import BeautifulSoup as soup
      from time import sleep
      from selenium import webdriver
      
      #url of the website we want to scrape which in this case is the url of daraz.pk for swimsuits
      my_url = "https://www.daraz.pk/catalog/?spm=a2a0e.home.search.1.35e349376res9Z&q=swimsuits&_keyori=ss&from=search_history&sugg=swimsuits_0_1"
      
      driver = webdriver.Chrome()
      driver.get(my_url)
      sleep(4)
      
      page = driver.page_source
      driver.close()
      
      pagesrc = soup(page, 'html.parser')
      
      #making a container to save all the data in
      container = pagesrc.find('div', {'class':'c1_t2i'})
      
      #our gallery is the product-item
      gallery = container.find_all('div', {'class':'c2prKC'})
      
      sleep(1)
      

      输出:

      >>> container
      <div class="c1_t2i" data-qa-locator="general-products" data-spm="list"><div class="c2prKC" data-aplus-ae="x1_60b98c2b" data-item-id="174400115" data-qa-locator="product-item" data-sku-simple="" data-spm-anchor-id="a2a0e.searchlist.list.i0.53fd981cjRpsMW"...title="Pakistan">Pakistan</span></div></div></div></div></div></div>
      
      >>> gallery
      [<div class="c2prKC" data-aplus-ae="x1_60b98c2b" data-item-id="174400115" data-qa-locator="product-item" data-sku-simple="" data-spm-anchor-id="a2a0e.searchlist.list.i0.53fd981cjRpsMW" data-tracking="product-card"><div class...title="Pakistan">Pakistan</span></div></div></div></div></div>]
      

      【讨论】:

        猜你喜欢
        • 2017-03-08
        • 1970-01-01
        • 2022-10-06
        • 1970-01-01
        • 2022-12-15
        • 1970-01-01
        • 2017-10-20
        • 1970-01-01
        • 2022-06-16
        相关资源
        最近更新 更多