如何在抓取时不被阻塞答案

【问题标题】：How to not get blocked while scraping如何在抓取时不被阻塞
【发布时间】：2019-04-27 18:10:01
【问题描述】：

我正在尝试搜索 Transfermarkt，一个足球网站。我正在尝试进行网络抓取，但每次尝试我都会在第 7 次请求时被阻止。

我尝试更改标头和代理，但总是得到相同的结果。

这些是我做的一些“实验”。这些代理分开工作。

user_agent_list = [here are a lot of user agents]
headers = {'User-Agent':random.choice(user_agent_list)}
url='https://www.transfermarkt.es/jadon-sancho/profil/spieler/14'

r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://121.121.117.227:3128'})
print(r)
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://121.121.117.227:3128'})
print(r)
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://121.121.117.227:3128'})
print(r)

#Changing proxy
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://177.131.22.186:80'})
print(r)
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://177.131.22.186:80'})
print(r)
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://177.131.22.186:80'})
print(r)
#Here I get blocked
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://177.131.22.186:80'})
print(r)
#And continue trying with another examples

我必须说明代理已经过验证，因此请单独工作。我从 prints 中得到的是直到我被阻止，我得到 .我该如何解决？我应该从 get 中更改另一个参数吗？

【问题讨论】：

标签： python web-scraping python-requests

【解决方案1】：

您的脚本的主要问题是您尝试使用http only 代理连接到https 服务器。你需要为https设置一个代理：

proxies={'https': 'https://x.y.z.a:b'}

在您的情况下，您只设置 http 代理，因此 https 请求不会通过它。

请注意，您在示例中提供的代理服务器不支持https。

【讨论】：

我已经尝试过“https”，然后在同一个字典中都尝试过，我得到了 requests.exceptions.ConnectionError()。并且使用 http 我得到至少 6 个请求。
告诉你，你正在使用的代理不支持https，你需要得到https代理服务器列表。当您为https 服务器使用http 设置时，您的连接会直接运行（它根本不使用http 代理服务器）。
据我所知，我正在这样做： r=requests.get(url,headers={'User-Agent':random.choice(user_agent_list)},proxies={'https' : '202.49.183.168:46110'}) 该代理是从 https 列表中获取的。
那个代理已经死了。
试试这个：23.20.214.120:3128