【发布时间】:2018-08-16 19:24:12
【问题描述】:
我正在使用带有 bs 4.6、selenium 3.6 和 phantomjs 的 python 3.5 来抓取这个站点。该脚本在我位于美国的服务器上运行,我想抓取一个德国网站。但是我遇到了一个问题。我下载的html是这样的:
<div class="col-md-40 product-highlights-container"><div class="product-filters"><select class="colorfilter__select"><option value="{"ebootisId":"HW102581-1","color":"Midnight Black","colorCode":"000000","colorGroup":"Schwarz","colorGroupCode":"000000","deliveryTime":"2-3 Werktage","default":true,"images":[{"small":"/img/dist/HW102581-1_ZU102869_S_1.png","medium":"/img/dist/HW102581-1_ZU102869_M_1.png","large":"/img/dist/HW102581-1_ZU102869_L_1.png"}],"storage":"64","tariffs":{"TF102910":{"ebootisId":"TF102910","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&speicher=64&carrier=vodafone&tarif=comfort-allnet"},"TF101415":{"ebootisId":"TF101415","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&speicher=64&carrier=telekom&tarif=comfort-allnet"}},"stock":1086,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&speicher=64&carrier=vodafone&tarif=comfort-allnet","price":49,"offer_id":"5a8bf20d56b4537a4076868a","soldout":false}">Midnight Black</option><option value="{"ebootisId":"HW102581-2","color":"Arctic Silver","colorCode":"c7ccd0","colorGroup":"Silber","colorGroupCode":"c0c0c0","deliveryTime":"2-3 Werktage","default":false,"images":[{"small":"/img/dist/HW102581-2_ZU102869_S_1.png","medium":"/img/dist/HW102581-2_ZU102869_M_1.png","large":"/img/dist/HW102581-2_ZU102869_L_1.png"}],"storage":"64","tariffs":{"TF102910":{"ebootisId":"TF102910","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=arctic-silver&speicher=64&carrier=vodafone&tarif=comfort-allnet"},"TF101415":{"ebootisId":"TF101415","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=arctic-silver&speicher=64&carrier=telekom&tarif=comfort-allnet"}},"stock":503,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=arctic-silver&speicher=64&carrier=vodafone&tarif=comfort-allnet","price":49,"offer_id":"5a8bf20d56b4537a4076868a","soldout":false}">Arctic Silver</option><option value="{"ebootisId":"HW102581-3","color":"Orchid Grey","colorCode":"9d9dad","colorGroup":"Grau","colorGroupCode":"dcdcdc","deliveryTime":"2-3 Werktage","default":false,"images":[{"small":"/img/dist/HW102581-3_ZU102869_S_1.png","medium":"/img/dist/HW102581-3_ZU102869_M_1.png","large":"/img/dist/HW102581-3_ZU102869_L_1.png"}],"storage":"64","tariffs":{"TF102910":{"ebootisId":"TF102910","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=orchid-grey&speicher=64&carrier=vodafone&tarif=comfort-allnet"},"TF101415":{"ebootisId":"TF101415","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=orchid-grey&speicher=64&carrier=telekom&tarif=comfort-allnet"}},"stock":500,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=orchid-grey&speicher=64&carrier=vodafone&tarif=comfort-allnet","price":49,"offer_id"
基本上就是一长串文字,让我找不到想要的标签。
如果我使用在线美化器或自己拆分线条,它可以正常工作,但这不是一个可行的解决方案。
我尝试使用 bs4 中的 prettify() 函数,但也没有用。
这就是相关的代码:
driver = webdriver.PhantomJS(executable_path = path_to_pjs)
driver.get(link)
f = open(filename, "wb")
f.write(driver.page_source.encode('utf-8'))
f.close()
driver.close()
ecj_data = open(filename ,'r', encoding='utf-8').read()
page_soup = soup(ecj_data,"lxml")
page_soup=page_soup.prettify()
【问题讨论】:
-
link是什么?然后我们可以重现问题。
标签: python html web-scraping beautifulsoup