这个问题有点棘手(对于python初学者),因为它涉及到selenium(用于无头浏览)和beautifulsoup(用于html数据提取)的组合。此外,由于文档对象模型 (DOM) 包含在 javascripting 中,因此问题变得很困难。我们知道 javascript 是存在的,因为当我们仅使用 beautifulsoup 访问时,我们从网站得到一个空响应,例如,for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
print(item_n.get_text())
因此,要从这样一个使用脚本语言控制其 DOM 的网页中提取数据,我们必须使用 selenium 进行无头浏览(这会告诉网站浏览器正在访问它)。我们还必须使用某种延迟参数(告诉网站它是由人访问的)。为此,selenium 库中的函数 WebdriverWait() 会有所帮助。
我现在介绍解释该过程的代码的 sn-ps。
首先,导入必要的库
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep
接下来,初始化无头浏览器的设置。我正在使用铬。
# create object for chrome options
chrome_options = Options()
base_url = 'https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales'
# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", {
"profile.default_content_setting_values.notifications": 2
})
# invoke the webdriver
browser = webdriver.Chrome(executable_path = r'C:/Users/username/Documents/playground_python/chromedriver.exe',
options = chrome_options)
browser.get(base_url)
delay = 5 #secods
接下来,我声明空列表变量来保存数据。
# declare empty lists
item_cost, item_init_cost, item_loc = [],[],[]
item_name, items_sold, discount_percent = [], [], []
while True:
try:
WebDriverWait(browser, delay)
print ("Page is ready")
sleep(5)
html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
#print(html)
soup = BeautifulSoup(html, "html.parser")
# find_all() returns an array of elements.
# We have to go through all of them and select that one you are need. And than call get_text()
for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
print(item_n.get_text())
item_name.append(item_n.text)
# find the price of items
for item_c in soup.find_all('span', class_='_341bF0'):
print(item_c.get_text())
item_cost.append(item_c.text)
# find initial item cost
for item_ic in soup.find_all('div', class_ = '_1w9jLI QbH7Ig U90Nhh'):
print(item_ic.get_text())
item_init_cost.append(item_ic.text)
# find total number of items sold/month
for items_s in soup.find_all('div',class_ = '_18SLBt'):
print(items_s.get_text())
items_sold.append(item_ic.text)
# find item discount percent
for dp in soup.find_all('span', class_ = 'percent'):
print(dp.get_text())
discount_percent.append(dp.text)
# find item location
for il in soup.find_all('div', class_ = '_3amru2'):
print(il.get_text())
item_loc.append(il.text)
break # it will break from the loop once the specific element will be present.
except TimeoutException:
print ("Loading took too much time!-Try again")
此后,我使用zip 函数来组合不同的列表项。
rows = zip(item_name, item_init_cost,discount_percent,item_cost,items_sold,item_loc)
最后,我将这些数据写入磁盘,
import csv
newFilePath = 'shopee_item_list.csv'
with open(newFilePath, "w") as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
作为一种好习惯,一旦任务完成,关闭无头浏览器是明智之举。所以我把它编码为,
# close the automated browser
browser.close()
结果
Nestle MILO Activ-Go Chocolate Malt Powder (2kg)
NESCAFE GOLD Refill (170g)
Nestle MILO Activ-Go Chocolate Malt Powder (1kg)
MAGGI Hot Cup - Asam Asam Laksa (60g)
MAGGI 2-Minit Curry (79g x 5 Packs x 2)
MAGGI PAZZTA Cheese Macaroni 70g
.......
29.90
21.90
16.48
1.69
8.50
3.15
5.90
.......
RM40.70
RM26.76
RM21.40
RM1.80
RM9.62
........
9k sold/month
2.3k sold/month
1.8k sold/month
1.7k sold/month
.................
27%
18%
23%
6%
.............
Selangor
Selangor
Selangor
Selangor
读者注意
OP 让我注意到 xpath 没有按照我的回答中给出的那样工作。两天后我再次查看网站,发现一个奇怪的现象。 div 类的 class_ 属性确实发生了变化。我找到了similar Q。但这并没有太大帮助。所以现在,我得出结论,shoppee 网站中的 div 属性可以再次更改。我把它作为一个待解决的开放问题。
OP 说明
Ana,上面的代码只适用于一页,即它只适用于网页https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales。我邀请您通过解决如何在销售标签下为多个网页抓取数据来进一步提高您的技能。您的提示是此页面右上角的1/9 和/或页面底部的1 2 3 4 5 链接。另一个提示是查看 urlparse 库中的 urljoin。希望这可以帮助您入门。
有用的资源