如何使用漂亮的汤从 Shopee 中抓取数据答案

【问题标题】：how to scrape data from shopee using beautiful soup如何使用漂亮的汤从 Shopee 中抓取数据
【发布时间】：2020-09-15 08:26:37
【问题描述】：

我目前是一名学生，目前我正在学习beautifulsoup，所以我的讲师作为我从shopee 抓取数据但是我无法抓取产品的详细信息。目前，我正在尝试从https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales 抓取数据。我只想抓取产品的名称和价格。有人能告诉我为什么我不能使用 beautifulsoup 抓取数据吗？

这是我的代码：

from requests import get
from bs4 import BeautifulSoup

url = "https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales"
response= get (url)
soup=BeautifulSoup(response.text,'html.parser')
print (soup)

【问题讨论】：

添加代码sn-p你到目前为止尝试过的，否则有人会如何帮助你
嗨，欢迎来到 SO，请添加您迄今为止尝试过的输入和预期输出。 stackoverflow.com/help/minimal-reproducible-example
我很抱歉我的无知@RajuBhaya
@Ana 你可以检查我的答案，mnm 也正确地说明了 Dom 元素。我添加了硒和beautifulsoup方式。

标签： python web-scraping beautifulsoup

【解决方案1】：

请发布您的代码，以便我们提供帮助。

或者你可以这样开始.. :)

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReg


my_url = "<url>"
uClient = uReg(my_url)
page_html = uClient.read()

【讨论】：

不要发布不完整的答案。我了解您试图帮助 OP，但您发布的内容基本上没有太大帮助。理想情况下，您应该以 cmets 的形式提供此帮助！明白我的意思了吗？
感谢您的建议，对不起，我还是新手，所以我有点模糊我该怎么办@mnm
@Ana 我可以向你解释答案，但我不能，因为这个问题现在已经关闭并且不接受任何答案。如果您仍然有兴趣了解和学习，那么我建议您Ask a New Q。我还建议您查看 selinium 库以实现浏览器自动化。这个库是解决您问题的关键。当您使用@mnm 在其中询问新的 Q 标记时，我会收到通知。请在新 Q 中表现出一些努力，就像您引用与您的 Q 相关的现有帖子以及您现在停留在哪里一样。希望这会有所帮助。
@mnm 我对此提出了新问题 > stackoverflow.com/q/62145548/13632126
@Ana，我注意到另一个 Q 已被删除。而且我还发现这个原来的Q，又被打开了。所以，我已经发布了我的答案。希望这会有所帮助。

【解决方案2】：

这个问题有点棘手（对于python初学者），因为它涉及到selenium（用于无头浏览）和beautifulsoup（用于html数据提取）的组合。此外，由于文档对象模型 (DOM) 包含在 javascripting 中，因此问题变得很困难。我们知道 javascript 是存在的，因为当我们仅使用 beautifulsoup 访问时，我们从网站得到一个空响应，例如，for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'): print(item_n.get_text())

因此，要从这样一个使用脚本语言控制其 DOM 的网页中提取数据，我们必须使用 selenium 进行无头浏览（这会告诉网站浏览器正在访问它）。我们还必须使用某种延迟参数（告诉网站它是由人访问的）。为此，selenium 库中的函数 WebdriverWait() 会有所帮助。

我现在介绍解释该过程的代码的 sn-ps。

首先，导入必要的库

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep

接下来，初始化无头浏览器的设置。我正在使用铬。

# create object for chrome options
chrome_options = Options()
base_url = 'https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales'

# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", { 
    "profile.default_content_setting_values.notifications": 2
    })
# invoke the webdriver
browser = webdriver.Chrome(executable_path = r'C:/Users/username/Documents/playground_python/chromedriver.exe',
                          options = chrome_options)
browser.get(base_url)
delay = 5 #secods

接下来，我声明空列表变量来保存数据。

# declare empty lists
item_cost, item_init_cost, item_loc = [],[],[]
item_name, items_sold, discount_percent = [], [], []
while True:
    try:
        WebDriverWait(browser, delay)
        print ("Page is ready")
        sleep(5)
        html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
        #print(html)
        soup = BeautifulSoup(html, "html.parser")

        # find_all() returns an array of elements. 
        # We have to go through all of them and select that one you are need. And than call get_text()
        for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
            print(item_n.get_text())
            item_name.append(item_n.text)

        # find the price of items
        for item_c in soup.find_all('span', class_='_341bF0'):
            print(item_c.get_text())
            item_cost.append(item_c.text)

        # find initial item cost
        for item_ic in soup.find_all('div', class_ = '_1w9jLI QbH7Ig U90Nhh'):
            print(item_ic.get_text())
            item_init_cost.append(item_ic.text)
        # find total number of items sold/month
        for items_s in soup.find_all('div',class_ = '_18SLBt'):
            print(items_s.get_text())
            items_sold.append(item_ic.text)

        # find item discount percent
        for dp in soup.find_all('span', class_ = 'percent'):
            print(dp.get_text())
            discount_percent.append(dp.text)
        # find item location
        for il in soup.find_all('div', class_ = '_3amru2'):
            print(il.get_text())
            item_loc.append(il.text)

        break # it will break from the loop once the specific element will be present. 
    except TimeoutException:
        print ("Loading took too much time!-Try again")

此后，我使用zip 函数来组合不同的列表项。

rows = zip(item_name, item_init_cost,discount_percent,item_cost,items_sold,item_loc)

最后，我将这些数据写入磁盘，

import csv
newFilePath = 'shopee_item_list.csv'
with open(newFilePath, "w") as f:
    writer = csv.writer(f)
    for row in rows:
        writer.writerow(row)

作为一种好习惯，一旦任务完成，关闭无头浏览器是明智之举。所以我把它编码为，

# close the automated browser
browser.close()

结果

Nestle MILO Activ-Go Chocolate Malt Powder (2kg)
NESCAFE GOLD Refill (170g)
Nestle MILO Activ-Go Chocolate Malt Powder (1kg)
MAGGI Hot Cup - Asam Asam Laksa (60g)
MAGGI 2-Minit Curry (79g x 5 Packs x 2)
MAGGI PAZZTA Cheese Macaroni 70g
.......
29.90
21.90
16.48
1.69
8.50
3.15
5.90
.......
RM40.70
RM26.76
RM21.40
RM1.80
RM9.62
........
9k sold/month
2.3k sold/month
1.8k sold/month
1.7k sold/month
.................
27%
18%
23%
6%
.............
Selangor
Selangor
Selangor
Selangor

读者注意

OP 让我注意到 xpath 没有按照我的回答中给出的那样工作。两天后我再次查看网站，发现一个奇怪的现象。 div 类的 class_ 属性确实发生了变化。我找到了similar Q。但这并没有太大帮助。所以现在，我得出结论，shoppee 网站中的 div 属性可以再次更改。我把它作为一个待解决的开放问题。

OP 说明

Ana，上面的代码只适用于一页，即它只适用于网页https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales。我邀请您通过解决如何在销售标签下为多个网页抓取数据来进一步提高您的技能。您的提示是此页面右上角的1/9 和/或页面底部的1 2 3 4 5 链接。另一个提示是查看 urlparse 库中的 urljoin。希望这可以帮助您入门。

有用的资源

XPATH tutorial

【讨论】：

非常感谢您的指导，我想请教这部分--> chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\ Local\\Google\\Chrome\\User Data\\Default') 我已经更改为我的本地 chrome 路径，但它显示如下 --> chrome_options.add_argument('C:\Users\ACER\AppData\Local\Google \Chrome\User Data\Default') ^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape 所以你能告诉我这部分我哪里错了。很抱歉再次打扰您
@Ana，我很乐意帮助初学者。你必须培养一种问问题的心态，但不要向所有人道歉。您是初学者，我们知道这一点。你一定会犯错误，这是正常的。我们都会在某个时间点犯错。话虽如此，如果您注意到，我在代码中使用了双反斜杠 \\ 而您使用了单个反斜杠 \，因此出现了错误。该错误告诉您它无法理解\。因为，它只理解 \\.所以 Q 是 为什么是 \\ 而不是 \?。答案是，在目录路径中使用双反斜杠有助于转义 unicode 字符集。
所以将代码C:\Users\ACER\AppData\Local\Google\Chrome\User Data\Default 中的行更改为C:\\Users\\ACER\\AppData\\Local\\Google\\Chrome\\User Data\\Default。试试看，如果不起作用，我建议只注释这行代码。
谢谢。它工作得很好，但是对于它只显示 1 个产品的产品名称......
@Ana 你能告诉我你为获取产品信息而执行的代码吗？

【解决方案3】：

在第一个请求通过 ajax 异步发送到页面后加载页面，因此发送一个请求并获取所需页面的源似乎是不可能的。

你应该模拟一个浏览器，然后你可以得到源代码，你可以使用beautifulsoup。见代码：

美汤方式

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

driver.get("https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales")
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.shop-search-result-view')))

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
search = soup.select_one('.shop-search-result-view')
products = search.find_all('a')

for p in products:
    name = p.select('div[data-sqe="name"] > div')[0].get_text()
    price = p.select('div > div:nth-child(2) > div:nth-child(2)')[0].get_text()
    product = p.select('div > div:nth-child(2) > div:nth-child(4)')[0].get_text()
    print('name: ' + name)
    print('price: ' + price)
    print('product: ' + product + '\n')

但是，使用 selenium 是获得您想要的一切的好方法。请看下面的例子：

硒方式

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver.get("https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales")
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.shop-search-result-view')))

search = driver.find_element_by_css_selector('.shop-search-result-view')
products = search.find_elements_by_css_selector('a')

for p in products:
    name = p.find_element_by_css_selector('div[data-sqe="name"] > div').text
    price = p.find_element_by_css_selector('div > div:nth-child(2) > div:nth-child(2)').text
    product = p.find_element_by_css_selector('div > div:nth-child(2) > div:nth-child(4)').text
    print('name: ' + name)
    print('price: ' + price.replace('\n', ' | '))
    print('product: ' + product + '\n')

【讨论】：