【问题标题】:Webscraping: Problem with dictionary inside list, json with duplicated dataWeb Scraping:列表中的字典问题,带有重复数据的 json
【发布时间】:2021-09-09 20:14:08
【问题描述】:

我正在尝试抓取亚马逊网站以获取有关其产品的数据。我通过 Selenium Firefox 和 BeautifulSoup4 获取产品的名称、价格和货币。

但是,包含所有结果的最终列表以重复数据告终。所有的结果都是一样的,我不知道为什么。

这是我的代码:

import json
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

url = 'https://www.amazon.com.br/'

option = Options()
option.headless = True
driver = webdriver.Firefox(options=option)

driver.get(url)

driver.find_element_by_id('twotabsearchtextbox').send_keys('teclado mecânico')
driver.find_element_by_id('nav-search-submit-button').click()

products_html = driver.find_elements_by_xpath("//div[@class='a-section a-spacing-medium']")
products_list = [{'title': '', 'image': '', 'price': '', 'currency': ''}] * len(products_html)

for i in range(len(products_list)):
    html_content = products_html[i].get_attribute('innerHTML')
    soup = BeautifulSoup(html_content, 'lxml')
    
    title = soup.find('span', class_='a-size-base-plus a-color-base a-text-normal')
    image = soup.find('img', class_='s-image')
    price = soup.find('span', class_='a-price-whole')
    decimal = soup.find('span', class_='a-price-fraction')
    currency = soup.find('span', class_='a-price-symbol')

    products_list[i]['title'] = title.text if title else ''
    products_list[i]['image'] = image['src'] if image else ''
    products_list[i]['price'] = price.text + decimal.text if price else ''
    products_list[i]['currency'] = currency.text if currency else ''

driver.quit()

with open('data.json', 'w') as data:
    json.dump(products_list, data, indent=4)

我的json文件的几行:

[
    {
        "title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
        "image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
        "price": "732,00",
        "currency": "R$"
    },
    {
        "title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
        "image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
        "price": "732,00",
        "currency": "R$"
    },
    {
        "title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
        "image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
        "price": "732,00",
        "currency": "R$"
    },
    {
        "title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
        "image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
        "price": "732,00",
        "currency": "R$"
    },
    {
        "title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
        "image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
        "price": "732,00",
        "currency": "R$"
    },

如您所见,json 中充满了相同的数据。

【问题讨论】:

    标签: python json selenium web-scraping beautifulsoup


    【解决方案1】:

    当你像这样创建product_list 时,你并没有创建 N 个不同的字典。您正在创建一个列表,其中包含 N 个对单个字典的引用。当您修改其中任何一个时,您正在修改所有这些。

    您应该将 product_list 创建为空:

    product_list = []
    

    然后每次都追加一个新字典。

    products_list.append({
        'title': title.text if title else '',
        'image': image['src'] if image else '',
        'price':  price.text + decimal.text if price else '',
        'currency': currency.text if currency else ''
    })
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2017-01-24
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-01-03
      • 2019-12-14
      相关资源
      最近更新 更多