网络抓取缓慢但不知道为什么答案

【问题标题】：Web-scraping slow but not sure why网络抓取缓慢但不知道为什么
【发布时间】：2018-03-31 15:23:42
【问题描述】：

我有很多网页抓取工作要做，所以我切换到了无头浏览器，希望这能让事情变得更快，但它并没有提高多少速度。

我看了这篇堆栈溢出帖子，但我不明白有人写的答案Is Selenium slow, or is my code wrong?

这是我的慢代码：

# followed this tutorial https://medium.com/@stevennatera/web-scraping-with-selenium-and-chrome-canary-on-macos-fc2eff723f9e
from selenium import webdriver
options = webdriver.ChromeOptions()
options.binary_location = '/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary'
options.add_argument('window-size=800x841')
options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://poshmark.com/search?')
xpath='//input[@id="user-search-box"]'
searchBox=driver.find_element_by_xpath(xpath)

brand="anthropology"

style="headband"

searchBox.send_keys(' '.join([brand,style]))

from selenium.webdriver.common.keys import Keys
#EQUIValent of hitting enter key
searchBox.send_keys(Keys.ENTER)




url=driver.current_url
print(url)
import requests
response=requests.get(url)
print(response)


print(response.text)
# using beautiful soup to grab the listins:






#______________________________


#print(response)
html=response.content
from bs4 import BeautifulSoup
from urllib.parse import urljoin



#print(html)
soup=BeautifulSoup(html,'html.parser')

#'a' as in links or anchore tags
anchore_tags=soup.find_all('a')


#print(x)




# finding the hyper links
#href is the hyperlink
hyper_links=[link.get("href") for link in soup.find_all("a")]
#print(hyper_links)

                        #(Better visual link this )
                        #href is the hyperlink
                        # for link in soup.find_all("a"):
                        #
                        #     print(link.get("href"))

clothing_listings=set([listing for listing in hyper_links if listing and "listing" in listing]) #  if the element and the word listing is in the element (becuase there could be a hyperlink that is NONE whcich is why we need the and )
# turning the list into a set because some of them are repeated
print(len(clothing_listings))
print(set(clothing_listings))
print(len(set(clothing_listings)))

#for somereason a link that is called unlike is showing up so im geting rid of those
clothing_listings=set([listing for listing in hyper_links if listing and "unlike" in listing]) #  if the element and the word listing is in the element (becuase there could be a hyperlink that is NONE whcich is why we need the and )
print(len(clothing_listings))# this is the correct size of the amount of clothing items by that search





driver.quit()

为什么刮东西要花这么长时间？

【问题讨论】：

硒是关于英国媒体报道的。如果您想要快速的东西，请使用 python 和 lxml 甚至更好：C 或 GO。无头浏览器的主要目标是不是速度执行，而是可以抓取 JS 生成的页面网站，制作屏幕截图...
太棒了！！！看来你现在 headless 工作了，但你还没有回复我在 trouble running chrome headless browser 上的回答
@DebanjanB 抱歉，那是因为我发布了一个答案，但有人把它删除了：/
@GillesQuenot 你了解stackoverflow链接中的解决方案吗？
@Bob Ofcoarse，您收到了来自审核小组的警告消息，因为 虽然此链接可能会回答问题，但最好在此处包含答案的基本部分并提供链接以供参考.如果链接页面发生更改，仅链接的答案可能会失效。您从未回复过。

标签： python-3.x selenium web-scraping beautifulsoup selenium-chromedriver

【解决方案1】：

您正在使用requests 获取网址。那么，为什么不使用它来完成整个任务。您使用selenium 的部分似乎是多余的。您只需使用它打开链接，然后使用requests 获取结果 URL。您所要做的就是传递适当的标头，您可以通过在 Chrome 或 Firefox 中查看开发人员工具的网络选项卡来收集这些标头。

rh = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9',
    'referer': 'https://poshmark.com/search?',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}

修改 URL 以搜索特定术语：

query = 'anthropology headband'
url = 'https://poshmark.com/search?query={}&type=listings&department=Women'.format(query)

然后，使用BeautifulSoup。此外，您可以使用特定于您想要的链接的任何属性来缩小您抓取的链接。在您的情况下，它是 covershot-con 的 class 属性。

r = requests.get(url, headers = rh)
soup = BeautifulSoup(r.content, 'lxml')

links = soup.find_all('a', {'class': 'covershot-con'})

结果如下：

for i in links:
    print(i['href'])

/listing/Anthro-Beaded-Headband-5a78fb899a9455e90aef438e
/listing/NWT-ANTHROPOLOGIE-Twisted-Vines-Crystal-Headband-5abbfb4a07003ad2dc58142f
/listing/Anthropologie-Nicole-Co-White-Floral-Headband-59dea5adeaf0302a5600bc41
/listing/NWT-ANTHROPOLOGIE-Namrata-Spring-Blossom-Headband-5ab5509d72769b52ba31829e
.
.
.
/listing/Anthropologie-By-Lilla-Spiky-Blue-Headband-59064f2ffbf6f90bfb01b854
/listing/Anthropologie-Beaded-Headband-5ab2cfe79d20f01a73ab0ddb
/listing/Anthropologie-Floral-Hawaiian-Headband-59d09eb941b4e0e1710871ec

编辑（提示）：

使用selenium 作为最后的手段（当所有其他方法都失败时）。正如@Gilles Quenot 所说，selenium 不是为了快速执行网络请求。
了解如何使用 requests 库（使用标头、传递数据等）。他们的documentation page 足以开始。它可以满足大多数抓取任务，而且速度很快。
即使对于需要执行 JS 的页面，如果您能弄清楚如何使用像 js2py 这样的库来执行 JS 部分，您也可以使用 requests。

【讨论】：

非常感谢！我有几个问题 1) 为什么我不能将请求查询为 queryParameters={'query':'+'.join([brand,"headband"]),'type':'listings','department ':'Women'} response=requests.get(search,params=queryParameters) 适合使用 rh 吗？当我以这种方式查询时，我得到了响应，但没有得到我正在寻找的 html。我也不知道如何在 chrome 上找到合适的标题，我在哪里突出显示要查找的区域？
@Bob 我尝试将查询参数作为数据传递，而不是在 URL 中硬编码它们。它没有用，我不知道为什么。此外，rh 与此无关。 rh 只是一个 dict 变量（请求标头的缩写），用于存储我从 Chrome 的网络选项卡复制的标头。看到这个：mkyong.com/computer-tips/…
谢谢！你知道请求头是否是静态的吗？你也是说当你尝试 queryParameters={'query':'+'.join([brand,"headband"]),'type':'listings','department ':'Women'} response=requests.get(search,params=queryParameters) 没有任何效果？因为当我尝试获取列表时，出于某种原因，我确实得到了响应，它们的数量比应有的要低得多。
如果你想问他们对于 specific 页面是否保持不变，那么是的，但可能不会持续很长时间。另一方面，如果您要询问这些是否是通用标头，可以随对任何站点上的任何页面的请求一起传递，那么不是。您必须记录活动（也在网络选项卡中 - 请用谷歌搜索），然后复制浏览器发送的标题。
我的意思是第一个问题，因为它是一本密集的字典，所以我担心它会随着时间的推移而改变。