使用 python 抓取无法抓取 HTML 变量答案

【问题标题】：Trouble Grabbing HTML variable using python scraping使用 python 抓取无法抓取 HTML 变量
【发布时间】：2018-11-17 23:52:11
【问题描述】：

此代码旨在从网站上抓取数据变量并将值绘制下来。我正在尝试使用它来绘制图形卡随时间变化的价格。

我正在使用 beautifulsoup，一切正常，但我无法正确打印价格。

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = "https://www.newegg.ca/Product/ProductList.aspx?Submit=ENE&N=100007708%20601210955%20601203901%20601294835%20601295933%20601194948&IsNodeId=1&bop=And&Order=BESTSELLING&PageSize=96"

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

containers = page_soup.findAll("div",{"class":"item-container"})

filename = "GPU Prices.csv"
f = open(filename, "w")

header = "Price,Product Brand,Product Name,Shipping Cost\n"

f.write(header)

for container in containers:
    price_container = container.findAll("li", {"class":"price-current"})
    price = price_container[0].text.strip()

    brand = container.div.div.a.img["title"]

    title_container = container.findAll("a", {"class":"item-title"})
    product_name = title_container[0].text

    shipping_container = container.findAll("li", {"class":"price-ship"})
    shipping = shipping_container[0].text.strip()

    price
    f.write(price.replace(",", "") + "," + brand.replace(",", ".") + "," +  product_name.replace(",", " |") + "," +  shipping + "\n")

f.close()

运行后，csv文件如下所示：

【问题讨论】：

价格打印不正确是什么意思？您不想打印可用的报价吗？或在 excel 中格式化它们？
如果您看到附加的屏幕截图，价格栏会跳过带有“-”和“|”等条目的行。它不是全部统一，而是创建了额外的线条，因此所有东西都排成一行，但成本却没有。

标签： python html csv web-scraping beautifulsoup

【解决方案1】：

我建议使用 Python CSV 库来帮助您编写文件。 csv.writer() 对象采用项目列表将自动在项目之间添加逗号，如果任何项目中包含逗号，它将自动将条目用引号括起来（这是处理它们的正确方法）。然后该文件将正确加载。

此外，您的 price 变量需要进行一些修剪，因为它包含许多需要首先删除的尾随字符。

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv

my_url = "https://www.newegg.ca/Product/ProductList.aspx?Submit=ENE&N=100007708%20601210955%20601203901%20601294835%20601295933%20601194948&IsNodeId=1&bop=And&Order=BESTSELLING&PageSize=96"

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"item-container"})
filename = "GPU Prices.csv"
header = ['Price', 'Product Brand', 'Product Name', 'Shipping Cost']

with open(filename, 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(header)

    for container in containers:
        price_container = container.findAll("li", {"class":"price-current"})
        price = price_container[0].text.replace('\xa0', ' ').strip(' –\r\n|')

        brand = container.div.div.a.img["title"]

        title_container = container.findAll("a", {"class":"item-title"})
        product_name = title_container[0].text

        shipping_container = container.findAll("li", {"class":"price-ship"})
        shipping = shipping_container[0].text.strip()

        csv_output.writerow([price, brand, product_name, shipping])

给你GPU Prices.csv，开头为：

Price,Product Brand,Product Name,Shipping Cost
"$1,079.99 (5 Offers)",ASUS,ASUS ROG GeForce GTX 1080 Ti DirectX 12 STRIX-GTX1080TI-O11G-GAMING 11GB 352-Bit GDDR5X PCI Express 3.0 HDCP Ready Video Card,$4.99 Shipping
$794.99 (5 Offers),ASUS,ASUS ROG GeForce GTX 1080 STRIX-GTX1080-A8G-GAMING 8GB 256-Bit GDDR5X PCI Express 3.0 HDCP Ready Video Card,$4.99 Shipping

查看第二行价格如何不包含逗号，因此不包含在引号中。这是正确的，并且会被 Excel 等应用程序正确处理。

【讨论】：