【问题标题】:malformed html with python and beautiful soup带有python和漂亮汤的格式错误的html
【发布时间】:2018-08-16 19:24:12
【问题描述】:

我正在使用带有 bs 4.6、selenium 3.6 和 phantomjs 的 python 3.5 来抓取这个站点。该脚本在我位于美国的服务器上运行,我想抓取一个德国网站。但是我遇到了一个问题。我下载的html是这样的:

<div class="col-md-40 product-highlights-container"><div class="product-filters"><select class="colorfilter__select"><option value="{&quot;ebootisId&quot;:&quot;HW102581-1&quot;,&quot;color&quot;:&quot;Midnight Black&quot;,&quot;colorCode&quot;:&quot;000000&quot;,&quot;colorGroup&quot;:&quot;Schwarz&quot;,&quot;colorGroupCode&quot;:&quot;000000&quot;,&quot;deliveryTime&quot;:&quot;2-3 Werktage&quot;,&quot;default&quot;:true,&quot;images&quot;:[{&quot;small&quot;:&quot;/img/dist/HW102581-1_ZU102869_S_1.png&quot;,&quot;medium&quot;:&quot;/img/dist/HW102581-1_ZU102869_M_1.png&quot;,&quot;large&quot;:&quot;/img/dist/HW102581-1_ZU102869_L_1.png&quot;}],&quot;storage&quot;:&quot;64&quot;,&quot;tariffs&quot;:{&quot;TF102910&quot;:{&quot;ebootisId&quot;:&quot;TF102910&quot;,&quot;price&quot;:49,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&amp;speicher=64&amp;carrier=vodafone&amp;tarif=comfort-allnet&quot;},&quot;TF101415&quot;:{&quot;ebootisId&quot;:&quot;TF101415&quot;,&quot;price&quot;:49,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&amp;speicher=64&amp;carrier=telekom&amp;tarif=comfort-allnet&quot;}},&quot;stock&quot;:1086,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&amp;speicher=64&amp;carrier=vodafone&amp;tarif=comfort-allnet&quot;,&quot;price&quot;:49,&quot;offer_id&quot;:&quot;5a8bf20d56b4537a4076868a&quot;,&quot;soldout&quot;:false}">Midnight Black</option><option value="{&quot;ebootisId&quot;:&quot;HW102581-2&quot;,&quot;color&quot;:&quot;Arctic Silver&quot;,&quot;colorCode&quot;:&quot;c7ccd0&quot;,&quot;colorGroup&quot;:&quot;Silber&quot;,&quot;colorGroupCode&quot;:&quot;c0c0c0&quot;,&quot;deliveryTime&quot;:&quot;2-3 Werktage&quot;,&quot;default&quot;:false,&quot;images&quot;:[{&quot;small&quot;:&quot;/img/dist/HW102581-2_ZU102869_S_1.png&quot;,&quot;medium&quot;:&quot;/img/dist/HW102581-2_ZU102869_M_1.png&quot;,&quot;large&quot;:&quot;/img/dist/HW102581-2_ZU102869_L_1.png&quot;}],&quot;storage&quot;:&quot;64&quot;,&quot;tariffs&quot;:{&quot;TF102910&quot;:{&quot;ebootisId&quot;:&quot;TF102910&quot;,&quot;price&quot;:49,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=arctic-silver&amp;speicher=64&amp;carrier=vodafone&amp;tarif=comfort-allnet&quot;},&quot;TF101415&quot;:{&quot;ebootisId&quot;:&quot;TF101415&quot;,&quot;price&quot;:49,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=arctic-silver&amp;speicher=64&amp;carrier=telekom&amp;tarif=comfort-allnet&quot;}},&quot;stock&quot;:503,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=arctic-silver&amp;speicher=64&amp;carrier=vodafone&amp;tarif=comfort-allnet&quot;,&quot;price&quot;:49,&quot;offer_id&quot;:&quot;5a8bf20d56b4537a4076868a&quot;,&quot;soldout&quot;:false}">Arctic Silver</option><option value="{&quot;ebootisId&quot;:&quot;HW102581-3&quot;,&quot;color&quot;:&quot;Orchid Grey&quot;,&quot;colorCode&quot;:&quot;9d9dad&quot;,&quot;colorGroup&quot;:&quot;Grau&quot;,&quot;colorGroupCode&quot;:&quot;dcdcdc&quot;,&quot;deliveryTime&quot;:&quot;2-3 Werktage&quot;,&quot;default&quot;:false,&quot;images&quot;:[{&quot;small&quot;:&quot;/img/dist/HW102581-3_ZU102869_S_1.png&quot;,&quot;medium&quot;:&quot;/img/dist/HW102581-3_ZU102869_M_1.png&quot;,&quot;large&quot;:&quot;/img/dist/HW102581-3_ZU102869_L_1.png&quot;}],&quot;storage&quot;:&quot;64&quot;,&quot;tariffs&quot;:{&quot;TF102910&quot;:{&quot;ebootisId&quot;:&quot;TF102910&quot;,&quot;price&quot;:49,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=orchid-grey&amp;speicher=64&amp;carrier=vodafone&amp;tarif=comfort-allnet&quot;},&quot;TF101415&quot;:{&quot;ebootisId&quot;:&quot;TF101415&quot;,&quot;price&quot;:49,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=orchid-grey&amp;speicher=64&amp;carrier=telekom&amp;tarif=comfort-allnet&quot;}},&quot;stock&quot;:500,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=orchid-grey&amp;speicher=64&amp;carrier=vodafone&amp;tarif=comfort-allnet&quot;,&quot;price&quot;:49,&quot;offer_id&quot

基本上就是一长串文字,让我找不到想要的标签。

如果我使用在线美化器或自己拆分线条,它可以正常工作,但这不是一个可行的解决方案。

我尝试使用 bs4 中的 prettify() 函数,但也没有用。

这就是相关的代码:

driver = webdriver.PhantomJS(executable_path = path_to_pjs)
driver.get(link)
f = open(filename, "wb")
f.write(driver.page_source.encode('utf-8'))
f.close()
driver.close()
ecj_data = open(filename ,'r', encoding='utf-8').read()
page_soup = soup(ecj_data,"lxml")
page_soup=page_soup.prettify()

【问题讨论】:

标签: python html web-scraping beautifulsoup


【解决方案1】:

您拥有的代码可以更改如下。它将创建一个名为pretty.html 的输出文件,其中包含prettify 版本的HTML:

from bs4 import BeautifulSoup
from selenium import webdriver

link = 'https://tarife.mediamarkt.de/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&speicher=64&carrier=vodafone&tarif=comfort-allnet'
filename = 'output.html'

driver = webdriver.PhantomJS() #executable_path=path_to_pjs)
driver.get(link)

with open(filename, "wb") as f_output:
    f_output.write(driver.page_source.encode('utf-8'))

page_soup = BeautifulSoup(driver.page_source, "lxml")

with open('pretty.html', 'w') as f_output:
    f_output.write(page_soup.prettify())

driver.close()

给你一个&lt;div&gt; 开始:

<div class="col-md-40 product-highlights-container">
 <div class="product-filters">
  <select class="colorfilter__select">
   <option value='{"ebootisId":"HW102581-1","color":"Midnight Black","colorCode":"000000","colorGroup":"Schwarz","colorGroupCode":"000000","deliveryTime":"2-3 Werktage","default":true,"images":[{"small":"/img/dist/HW102581-1_ZU102869_S_1.png","medium":"/img/dist/HW102581-1_ZU102869_M_1.png","large":"/img/dist/HW102581-1_ZU102869_L_1.png"}],"storage":"64","tariffs":{"TF102910":{"ebootisId":"TF102910","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&amp;speicher=64&amp;carrier=vodafone&amp;tarif=comfort-allnet"},"TF101415":{"ebootisId":"TF101415","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&amp;speicher=64&amp;carrier=telekom&amp;tarif=comfort-allnet"}},"stock":1075,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&amp;speicher=64&amp;carrier=vodafone&amp;tarif=comfort-allnet","price":49,"offer_id":"5a8bf20d56b4537a4076868a","soldout":false}'>

【讨论】:

  • 感谢就像一个魅力。我遇到了一点 ascii 问题,但可以通过 with open(str(i)+'_pretty.html','wb') as f_output: f_output.write(page_soup.prettify(encoding='utf-8')) 解决。
猜你喜欢
  • 2017-08-24
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2013-03-21
  • 2017-05-23
  • 2018-11-03
相关资源
最近更新 更多