【发布时间】:2019-02-04 06:27:17
【问题描述】:
我正在尝试从亚马逊的产品页面中抓取数据。我已经用beautifulsoup 获得了整个标记。我想获取以下json格式的必要产品详细信息
{
asin: string,
title: string,
price: number,
listPrice: number,
prime: boolean,
dimensions: {
height: number,
length: number,
width: number,
weight: number,
},
images: Array<string>,
attributes: Array<{ name: string, value: string }>,
categories: <{ node: string, title: string }>,
}
据我所知,我需要先获取字典格式的详细信息。但不确定如何从巨大的 html 中获取这些特定标签以将它们转换为 dict。
编辑:我的代码看起来像这样
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/B00ILZH9BO"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text,"lxml")
print(soup)
编辑 2:我提供了一些我需要产品详细信息的 html
#######title#########
<span class="a-size-large" id="productTitle">
MagicBrite Complete Teeth Whitening Kit At Home Whitening
</span>
#########price#####
<span class="a-color-price">
<span class="p13n-sc-price">$11.85</span>
</span>
############images#########
<li class="a-spacing-small item"><span class="a-list-item">
<span class="a-declarative" data-action="thumb-action" data-thumb-action='{"thumbnailIndex":4,"variant":"PT04","index":4,"type":"image"}'>
<span class="a-button a-button-thumbnail a-button-toggle"><span class="a-button-inner"><input class="a-button-input" type="submit"/><span aria-hidden="true" class="a-button-text">
<img alt="" src="https://images-na.ssl-images-amazon.com/images/I/51f8kCdwmqL._SS40_.jpg"/>
</span></span></span>
</span>
</span></li>
<li class="a-spacing-small item"><span class="a-list-item">
<span class="a-declarative" data-action="thumb-action" data-thumb-action='{"thumbnailIndex":5,"variant":"PT05","index":5,"type":"image"}'>
<span class="a-button a-button-thumbnail a-button-toggle"><span class="a-button-inner"><input class="a-button-input" type="submit"/><span aria-hidden="true" class="a-button-text">
<img alt="" src="https://images-na.ssl-images-amazon.com/images/I/517mTOTBEiL._SS40_.jpg"/>
</span></span></span>
</span>
</span></li>
【问题讨论】:
-
欢迎来到堆栈!提问时的一般规则是提供尽可能多的细节。包括您可能编写的任何代码以及任何其他信息,例如针对此特定问题的 html。还!您应该尝试在堆栈溢出中搜索已经回答的类似问题,例如stackoverflow.com/questions/42184367/…
-
阿努莎感谢您的编辑。但是,如果您提供要提取的 html,它将有助于鼓励人们回答问题。通过提供尽可能多的信息,用户更有可能回答问题。
-
感谢您的指导!其实html很大,不知道怎么上传
-
您在问题中提到要提取产品详细信息。只需显示此信息所在的 html 结构
-
添加和编辑! @JulianSilvestri
标签: python json dictionary web-scraping beautifulsoup