【发布时间】:2015-03-08 08:49:20
【问题描述】:
我无法让 decompose() 函数在我使用 python 和 BeautifulSoup 制作的爬虫中工作。
问题如下。我正在尝试从网站产品中获取所有规范数据(您可以在源代码中看到):
soup = soup_function('http://www.processorstore.nl/product/476816/category-212194/intel-core-i7-4790k.html')
dt = soup.findAll('dt', {'class': 'product-specs--item-title'})
for i in range(0, len(dt)):
dtRows = dt[i]
dtRowsStrip = dtRows.text.strip()
print(dtRows.text.strip())
# print(dtRows)
# dtRowsSplit = "".join(dtRowsStrip.split())
# print(dtRowsSplit)
当我使用:
> print(dtRows.text.strip())
我得到输出,这是:
Serie
Threads
Socket
Kloksnelheid
Fabrikantcode
Artikelnummer
Merk
Garantie
Garantietype
Serie
Serie
Socket
Socket
Codenaam
Codenaam
Threads
Threads
Turbo Frequency
Turbo Frequency
Multiplier unlocked
Multiplier unlocked
Cache
Cache
Geheugencontroller
Geheugencontroller
etc ....
第一个完整的行是正确的。在第二行,由于<a>
标记内的<dt>
标记,我得到双值。
一个例子是这样的:
<dt class="product-specs--item-title">
<a class="product-specs--help-icon js-tooltip" href="#spec_Serie" title="Zowel AMD als Intel produceren processoren in verschillende series. Een serie is bedoeld voor bepaald gebruik. Zo zijn Core i3 processoren geschikt voor internet & office werkzaamheden en Core i7 processoren voor veeleisende multitasking en gaming. Binnen een serie zijn verschillende modellen processoren verkrijgbaar. Van welke serie is deze processor onderdeel?"><i class="icon icon-circle-questionmark"></i><span class="product-specs--help-title">Serie</span></a>
<span>Serie</span>
</dt>
谁能帮我删除完整的<a>
标签?
附加信息:
#如果我使用以下代码:
soup = soup_function('http://www.processorstore.nl/product/476816/category-212194/intel-core-i7-4790k.html')
for spec in soup.select('dt.product-specs--item-title'):
print(spec.get_text(strip=True))
输出如下:
Serie
Threads
Socket
Kloksnelheid
Fabrikantcode
Artikelnummer
Merk
Garantie
Garantietype
SerieSerie
SocketSocket
CodenaamCodenaam
ThreadsThreads
Turbo FrequencyTurbo Frequency
Multiplier unlockedMultiplier unlocked
CacheCache
GeheugencontrollerGeheugencontroller
ProductieprocesProductieproces
Stroomverbruik maximaalStroomverbruik maximaal
KloksnelheidKloksnelheid
ProcessorkernenProcessorkernen
Type GPUType GPU
如您所见。在第二个 <dl>
块之后,我得到双值。
附加: 谢谢...我也刚刚发现。我知道您的代码更好,但只是想分享我的解决方案:
for spec in soup.select('div.product-specs dl.product-specs--list > dt.product-specs--item-title span.product-specs--help-title'):
print(spec.get_text(strip=True))
parent = spec.find_parent('dt')
value = parent.find_next_sibling("dd", {'class': 'product-specs--item-spec'})
print(value.text.strip())
【问题讨论】:
标签: python web-scraping beautifulsoup html-parsing web-crawler