【发布时间】:2021-09-09 20:14:08
【问题描述】:
我正在尝试抓取亚马逊网站以获取有关其产品的数据。我通过 Selenium Firefox 和 BeautifulSoup4 获取产品的名称、价格和货币。
但是,包含所有结果的最终列表以重复数据告终。所有的结果都是一样的,我不知道为什么。
这是我的代码:
import json
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
url = 'https://www.amazon.com.br/'
option = Options()
option.headless = True
driver = webdriver.Firefox(options=option)
driver.get(url)
driver.find_element_by_id('twotabsearchtextbox').send_keys('teclado mecânico')
driver.find_element_by_id('nav-search-submit-button').click()
products_html = driver.find_elements_by_xpath("//div[@class='a-section a-spacing-medium']")
products_list = [{'title': '', 'image': '', 'price': '', 'currency': ''}] * len(products_html)
for i in range(len(products_list)):
html_content = products_html[i].get_attribute('innerHTML')
soup = BeautifulSoup(html_content, 'lxml')
title = soup.find('span', class_='a-size-base-plus a-color-base a-text-normal')
image = soup.find('img', class_='s-image')
price = soup.find('span', class_='a-price-whole')
decimal = soup.find('span', class_='a-price-fraction')
currency = soup.find('span', class_='a-price-symbol')
products_list[i]['title'] = title.text if title else ''
products_list[i]['image'] = image['src'] if image else ''
products_list[i]['price'] = price.text + decimal.text if price else ''
products_list[i]['currency'] = currency.text if currency else ''
driver.quit()
with open('data.json', 'w') as data:
json.dump(products_list, data, indent=4)
我的json文件的几行:
[
{
"title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
"image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
"price": "732,00",
"currency": "R$"
},
{
"title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
"image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
"price": "732,00",
"currency": "R$"
},
{
"title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
"image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
"price": "732,00",
"currency": "R$"
},
{
"title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
"image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
"price": "732,00",
"currency": "R$"
},
{
"title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
"image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
"price": "732,00",
"currency": "R$"
},
如您所见,json 中充满了相同的数据。
【问题讨论】:
标签: python json selenium web-scraping beautifulsoup