【发布时间】:2020-06-23 04:36:52
【问题描述】:
试图了解为什么在尝试从 www.gamestop.com 下载 index.html 时访问被拒绝。我已经想出了如何解决它。 https://www.gamestop.com/on/demandware.static/Sites-gamestop-us-Site/-/default/v1592871955944/js/main.js。我想知道是否有人理解为什么基本 url (www.gamestop.com) 被拒绝。
Code:
import requests
import http.client as http_client
import logging
headers = {
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding':'gzip, deflate, br',
'accept-language':'en-US,en;q=0.9',
'cache-control':'max-age=0',
'connection':'keep-alive',
'dnt':'1',
'downlink':'10',
'ect':'4g',
'rtt':'50',
'sec-fetch-dest':'document',
'sec-fetch-mode':'navigate',
'sec-fetch-site':'none',
'sec-fetch-user':'?1',
'upgrade-insecure-requests':'1',
'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.410 3.97 Safari/537.36'
}
http_client.HTTPConnection.debuglevel = 1
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True
r = requests.get('https://www.gamestop.com', headers=headers)
print(r.text)
print(r.status_code)
print(r.headers)
Output:
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.gamestop.com:443
send: b'GET / HTTP/1.1\r\nHost: www.gamestop.com\r\nuser-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.410 3.97 Safari/537.36\r\naccept-encoding: gzip, deflate, br\r\naccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r\nconnection: keep-alive\r\naccept-language: en-US,en;q=0.9\r\ncache-control: max-age=0\r\ndnt: 1\r\ndownlink: 10\r\nect: 4g\r\nrtt: 50\r\nsec-fetch-dest: document\r\nsec-fetch-mode: navigate\r\nsec-fetch-site: none\r\nsec-fetch-user: ?1\r\nupgrade-insecure-requests: 1\r\n\r\n'
reply: 'HTTP/1.1 403 Forbidden\r\n'
header: Server: AkamaiGHost
header: Mime-Version: 1.0
header: Content-Type: text/html
header: Content-Length: 265
header: Expires: Fri, 26 Jun 2020 19:54:19 GMT
header: Date: Fri, 26 Jun 2020 19:54:19 GMT
header: Connection: close
header: Server-Timing: cdn-cache; desc=HIT
header: Server-Timing: cdn-cache; desc=HIT
DEBUG:urllib3.connectionpool:https://www.gamestop.com:443 "GET / HTTP/1.1" 403 265
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access "http://www.gamestop.com/" on this server.<P>
Reference #18.19e8d93f.1593201259.5c2b9d0
</BODY>
</HTML>
403
{'Server': 'AkamaiGHost', 'Mime-Version': '1.0', 'Content-Type': 'text/html', 'Content-Length': '265', 'Expires': 'Fri, 26 Jun 2020 19:54:19 GMT', 'Date': 'Fri, 26 Jun 2020 19:54:19 GMT', 'Connection': 'close', 'Server-Timing': 'cdn-cache; desc=HIT, edge; dur=1'}
【问题讨论】:
-
许多网站不允许跨域主机访问,因为这存在安全风险。也为了防止刮他们的服务器。由于您没有将自己标识为服务器的接受域,因此它会拒绝您的请求。
-
接受域是什么意思?我通过网络浏览器访问该站点没有问题,我通过任何不是被认可的网络浏览器的方式被拒绝。卷曲/wget/python。使用浏览器中的开发工具,我能够获得 curl / wget 的设置,以及标题但那些被拒绝。
-
对我来说,术语域可能是糟糕的命名法。 “起源”可能是更好的术语。话虽如此,删除面向对象的方法并仅使用您提供的标头发出请求,我收到了 200 响应和页面 HTML。您在示例 Web_Scrapper 类中指的是“self.headers”,但从未分配过它。这是因为这是快速拼凑的示例代码吗?还是直接来自您的源代码实现?
-
你能发布 request.get 你用吗?
-
谢谢你,给了我一些工作。
标签: python-3.x url web-scraping http-get access-denied