【问题标题】:Trouble Grabbing HTML variable using python scraping使用 python 抓取无法抓取 HTML 变量
【发布时间】:2018-11-17 23:52:11
【问题描述】:

此代码旨在从网站上抓取数据变量并将值绘制下来。我正在尝试使用它来绘制图形卡随时间变化的价格。

我正在使用 beautifulsoup,一切正常,但我无法正确打印价格。

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = "https://www.newegg.ca/Product/ProductList.aspx?Submit=ENE&N=100007708%20601210955%20601203901%20601294835%20601295933%20601194948&IsNodeId=1&bop=And&Order=BESTSELLING&PageSize=96"

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

containers = page_soup.findAll("div",{"class":"item-container"})

filename = "GPU Prices.csv"
f = open(filename, "w")

header = "Price,Product Brand,Product Name,Shipping Cost\n"

f.write(header)

for container in containers:
    price_container = container.findAll("li", {"class":"price-current"})
    price = price_container[0].text.strip()

    brand = container.div.div.a.img["title"]

    title_container = container.findAll("a", {"class":"item-title"})
    product_name = title_container[0].text

    shipping_container = container.findAll("li", {"class":"price-ship"})
    shipping = shipping_container[0].text.strip()

    price
    f.write(price.replace(",", "") + "," + brand.replace(",", ".") + "," +  product_name.replace(",", " |") + "," +  shipping + "\n")

f.close()

运行后,csv文件如下所示:

【问题讨论】:

  • 价格打印不正确是什么意思?您不想打印可用的报价吗?或在 excel 中格式化它们?
  • 如果您看到附加的屏幕截图,价格栏会跳过带有“-”和“|”等条目的行。它不是全部统一,而是创建了额外的线条,因此所有东西都排成一行,但成本却没有。

标签: python html csv web-scraping beautifulsoup


【解决方案1】:

我建议使用 Python CSV 库来帮助您编写文件。 csv.writer() 对象采用项目列表将自动在项目之间添加逗号,如果任何项目中包含逗号,它将自动将条目用引号括起来(这是处理它们的正确方法)。然后该文件将正确加载。

此外,您的 price 变量需要进行一些修剪,因为它包含许多需要首先删除的尾随字符。

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv

my_url = "https://www.newegg.ca/Product/ProductList.aspx?Submit=ENE&N=100007708%20601210955%20601203901%20601294835%20601295933%20601194948&IsNodeId=1&bop=And&Order=BESTSELLING&PageSize=96"

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"item-container"})
filename = "GPU Prices.csv"
header = ['Price', 'Product Brand', 'Product Name', 'Shipping Cost']

with open(filename, 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(header)

    for container in containers:
        price_container = container.findAll("li", {"class":"price-current"})
        price = price_container[0].text.replace('\xa0', ' ').strip(' –\r\n|')

        brand = container.div.div.a.img["title"]

        title_container = container.findAll("a", {"class":"item-title"})
        product_name = title_container[0].text

        shipping_container = container.findAll("li", {"class":"price-ship"})
        shipping = shipping_container[0].text.strip()

        csv_output.writerow([price, brand, product_name, shipping])

给你GPU Prices.csv,开头为:

Price,Product Brand,Product Name,Shipping Cost
"$1,079.99 (5 Offers)",ASUS,ASUS ROG GeForce GTX 1080 Ti DirectX 12 STRIX-GTX1080TI-O11G-GAMING 11GB 352-Bit GDDR5X PCI Express 3.0 HDCP Ready Video Card,$4.99 Shipping
$794.99 (5 Offers),ASUS,ASUS ROG GeForce GTX 1080 STRIX-GTX1080-A8G-GAMING 8GB 256-Bit GDDR5X PCI Express 3.0 HDCP Ready Video Card,$4.99 Shipping

查看第二行价格如何不包含逗号,因此不包含在引号中。这是正确的,并且会被 Excel 等应用程序正确处理。

【讨论】:

    猜你喜欢
    • 2021-12-24
    • 2010-12-20
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-05-12
    • 2016-01-31
    相关资源
    最近更新 更多