【发布时间】:2018-03-31 15:23:42
【问题描述】:
我有很多网页抓取工作要做,所以我切换到了无头浏览器,希望这能让事情变得更快,但它并没有提高多少速度。
我看了这篇堆栈溢出帖子,但我不明白有人写的答案Is Selenium slow, or is my code wrong?
这是我的慢代码:
# followed this tutorial https://medium.com/@stevennatera/web-scraping-with-selenium-and-chrome-canary-on-macos-fc2eff723f9e
from selenium import webdriver
options = webdriver.ChromeOptions()
options.binary_location = '/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary'
options.add_argument('window-size=800x841')
options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://poshmark.com/search?')
xpath='//input[@id="user-search-box"]'
searchBox=driver.find_element_by_xpath(xpath)
brand="anthropology"
style="headband"
searchBox.send_keys(' '.join([brand,style]))
from selenium.webdriver.common.keys import Keys
#EQUIValent of hitting enter key
searchBox.send_keys(Keys.ENTER)
url=driver.current_url
print(url)
import requests
response=requests.get(url)
print(response)
print(response.text)
# using beautiful soup to grab the listins:
#______________________________
#print(response)
html=response.content
from bs4 import BeautifulSoup
from urllib.parse import urljoin
#print(html)
soup=BeautifulSoup(html,'html.parser')
#'a' as in links or anchore tags
anchore_tags=soup.find_all('a')
#print(x)
# finding the hyper links
#href is the hyperlink
hyper_links=[link.get("href") for link in soup.find_all("a")]
#print(hyper_links)
#(Better visual link this )
#href is the hyperlink
# for link in soup.find_all("a"):
#
# print(link.get("href"))
clothing_listings=set([listing for listing in hyper_links if listing and "listing" in listing]) # if the element and the word listing is in the element (becuase there could be a hyperlink that is NONE whcich is why we need the and )
# turning the list into a set because some of them are repeated
print(len(clothing_listings))
print(set(clothing_listings))
print(len(set(clothing_listings)))
#for somereason a link that is called unlike is showing up so im geting rid of those
clothing_listings=set([listing for listing in hyper_links if listing and "unlike" in listing]) # if the element and the word listing is in the element (becuase there could be a hyperlink that is NONE whcich is why we need the and )
print(len(clothing_listings))# this is the correct size of the amount of clothing items by that search
driver.quit()
为什么刮东西要花这么长时间?
【问题讨论】:
-
硒是关于英国媒体报道的。如果您想要快速的东西,请使用 python 和 lxml 甚至更好:C 或 GO。无头浏览器的主要目标是不是速度执行,而是可以抓取 JS 生成的页面网站,制作屏幕截图...
-
太棒了!!!看来你现在 headless 工作了,但你还没有回复我在 trouble running chrome headless browser 上的回答
-
@DebanjanB 抱歉,那是因为我发布了一个答案,但有人把它删除了:/
-
@GillesQuenot 你了解stackoverflow链接中的解决方案吗?
-
@Bob Ofcoarse,您收到了来自审核小组的警告消息,因为 虽然此链接可能会回答问题,但最好在此处包含答案的基本部分并提供链接以供参考.如果链接页面发生更改,仅链接的答案可能会失效。您从未回复过。
标签: python-3.x selenium web-scraping beautifulsoup selenium-chromedriver