从 darak.pk 抓取网页给出错误 AttributeError: 'NoneType' object has no attribute 'find_all'答案

【问题标题】：web scraping from darak.pk giving error AttributeError: 'NoneType' object has no attribute 'find_all'从 darak.pk 抓取网页给出错误 AttributeError: 'NoneType' object has no attribute 'find_all'
【发布时间】：2021-02-05 08:02:07
【问题描述】：

我正在尝试从网站 daraz.pk 抓取数据，这是我迄今为止在 jupyter notebook 中编写的代码：

import requests
from bs4 import BeautifulSoup as soup
from time import sleep
#url of the website we want to scrape which in this case is the url of daraz.pk for swimsuits
my_url = "https://www.daraz.pk/catalog/?spm=a2a0e.home.search.1.35e349376res9Z&q=swimsuits&_keyori=ss&from=search_history&sugg=swimsuits_0_1"
page = requests.get(my_url)
pagesrc = soup(page.text, 'html.parser')
#making a container to save all the data in
container = pagesrc.find('div', {'class':'c1_t2i'})
#our gallery is the product-item
gallery = container.find_all('div', {'class':'c2prKC'})
sleep(1)

这是我得到的错误：

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-4fa8d4bc2410> in <module>
      6 container = pagesrc.find('div', {'class':'c1_t2i'})
      7 #our gallery is the product-item
----> 8 gallery = container.find_all('div', {'class':'c2prKC'})
      9 sleep(1)

AttributeError: 'NoneType' object has no attribute 'find_all'

我对网络抓取非常陌生，我试图遵循堆栈溢出的答案，该答案出现在同一主题的另一个问题中，但没有帮助。这是问题

Python error: 'NoneType' object has no attribute 'find_all'

我们将不胜感激！

【问题讨论】：

标签： html python-3.x web-scraping beautifulsoup python-requests

【解决方案1】：

您看到的数据以 Javascript/Json 格式嵌入页面中。要解析它，您可以使用以下示例：

import re
import json
import requests 


url = 'https://www.daraz.pk/catalog/?spm=a2a0e.home.search.1.35e349376res9Z&q=swimsuits&_keyori=ss&from=search_history&sugg=swimsuits_0_1'
html_doc = requests.get(url).text
data = json.loads(re.search(r'window\.pageData=({.*})', html_doc).group(1))

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

# print some data to screen:
for item in data['mods']['listItems']:
    print(item['name'])
    print(item['price'])
    print(item['image'])
    print('-' * 80)

打印：

Toppick Sexy Deep V neck Swimwear Women Print Backless Bandage Cut Out Monokini Badpak One Piece Swimsuit Women bathing suit
2699.00
https://static-01.daraz.pk/p/ce73ff4ac121ad3f401a753d548c641b.jpg
--------------------------------------------------------------------------------
Toppick Sexy Deep V neck Swimwear Women Print Backless Bandage Cut Out Monokini Badpak One Piece Swimsuit Women bathing suit
2699.00
https://static-01.daraz.pk/p/2329f166ddb75a6942d010e7abb14e66.jpg
--------------------------------------------------------------------------------
Soft One Piece Bandage Push Up Bikini Women's Swimwear Swimsuit Bathing Suit Bottoms Thong Summer Beach Triangle Costume Small Size Black Color
1900.00
https://static-01.daraz.pk/p/3867c6255fdc5becb8b347ca78ee20a8.jpg
--------------------------------------------------------------------------------
Swimwear for Women  One Piece Sleeveless Women’ s Swimsuit Pure Color Bathing Suit
1510.00
https://static-01.daraz.pk/p/9e748d76cbeee1da8bcedb412f60902f.png
--------------------------------------------------------------------------------
Women Sexy Swimwear Bra God Save Queens Letters(S)
1199.00
https://static-01.daraz.pk/p/a60750bd34113a2cf55dd31c715cb47c.jpg
--------------------------------------------------------------------------------

...and so on.

【讨论】：

【解决方案2】：

这是因为该网站是使用 javascript 动态加载的（我在 Stack Overflow 的答案中几乎有一半都提到了这一点 XD）。你可以使用selenium来解决这个问题：

from selenium import webdriver
my_url = "https://www.daraz.pk/catalog/?spm=a2a0e.home.search.1.35e349376res9Z&q=swimsuits&_keyori=ss&from=search_history&sugg=swimsuits_0_1"
driver = webdriver.Chrome()
driver.get(my_url)
sleep(4)
page = driver.page_source
driver.close()

因此，这里是完整的代码：

import requests
from bs4 import BeautifulSoup as soup
from time import sleep
from selenium import webdriver

#url of the website we want to scrape which in this case is the url of daraz.pk for swimsuits
my_url = "https://www.daraz.pk/catalog/?spm=a2a0e.home.search.1.35e349376res9Z&q=swimsuits&_keyori=ss&from=search_history&sugg=swimsuits_0_1"

driver = webdriver.Chrome()
driver.get(my_url)
sleep(4)

page = driver.page_source
driver.close()

pagesrc = soup(page, 'html.parser')

#making a container to save all the data in
container = pagesrc.find('div', {'class':'c1_t2i'})

#our gallery is the product-item
gallery = container.find_all('div', {'class':'c2prKC'})

sleep(1)

输出：

>>> container
<div class="c1_t2i" data-qa-locator="general-products" data-spm="list"><div class="c2prKC" data-aplus-ae="x1_60b98c2b" data-item-id="174400115" data-qa-locator="product-item" data-sku-simple="" data-spm-anchor-id="a2a0e.searchlist.list.i0.53fd981cjRpsMW"...title="Pakistan">Pakistan</span></div></div></div></div></div></div>

>>> gallery
[<div class="c2prKC" data-aplus-ae="x1_60b98c2b" data-item-id="174400115" data-qa-locator="product-item" data-sku-simple="" data-spm-anchor-id="a2a0e.searchlist.list.i0.53fd981cjRpsMW" data-tracking="product-card"><div class...title="Pakistan">Pakistan</span></div></div></div></div></div>]

【讨论】：