HTTP 获取请求访问被拒绝答案

【问题标题】：HTTP get request Access DeniedHTTP 获取请求访问被拒绝
【发布时间】：2020-06-23 04:36:52
【问题描述】：

试图了解为什么在尝试从 www.gamestop.com 下载 index.html 时访问被拒绝。我已经想出了如何解决它。 https://www.gamestop.com/on/demandware.static/Sites-gamestop-us-Site/-/default/v1592871955944/js/main.js。我想知道是否有人理解为什么基本 url (www.gamestop.com) 被拒绝。

Code:
import requests
import http.client as http_client
import logging

headers = {
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding':'gzip, deflate, br',
'accept-language':'en-US,en;q=0.9',
'cache-control':'max-age=0',
'connection':'keep-alive',
'dnt':'1',
'downlink':'10',
'ect':'4g',
'rtt':'50',
'sec-fetch-dest':'document',
'sec-fetch-mode':'navigate',
'sec-fetch-site':'none',
'sec-fetch-user':'?1',
'upgrade-insecure-requests':'1',
'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.410    3.97 Safari/537.36'
}

http_client.HTTPConnection.debuglevel = 1
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True
r = requests.get('https://www.gamestop.com', headers=headers)
print(r.text)
print(r.status_code)
print(r.headers)

Output:
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.gamestop.com:443
send: b'GET / HTTP/1.1\r\nHost: www.gamestop.com\r\nuser-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.410    3.97 Safari/537.36\r\naccept-encoding: gzip, deflate, br\r\naccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r\nconnection: keep-alive\r\naccept-language: en-US,en;q=0.9\r\ncache-control: max-age=0\r\ndnt: 1\r\ndownlink: 10\r\nect: 4g\r\nrtt: 50\r\nsec-fetch-dest: document\r\nsec-fetch-mode: navigate\r\nsec-fetch-site: none\r\nsec-fetch-user: ?1\r\nupgrade-insecure-requests: 1\r\n\r\n'
reply: 'HTTP/1.1 403 Forbidden\r\n'
header: Server: AkamaiGHost
header: Mime-Version: 1.0
header: Content-Type: text/html
header: Content-Length: 265
header: Expires: Fri, 26 Jun 2020 19:54:19 GMT
header: Date: Fri, 26 Jun 2020 19:54:19 GMT
header: Connection: close
header: Server-Timing: cdn-cache; desc=HIT
header: Server-Timing: cdn-cache; desc=HIT
DEBUG:urllib3.connectionpool:https://www.gamestop.com:443 "GET / HTTP/1.1" 403 265
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
 
You don't have permission to access "http&#58;&#47;&#47;www&#46;gamestop&#46;com&#47;" on this server.<P>
Reference&#32;&#35;18&#46;19e8d93f&#46;1593201259&#46;5c2b9d0
</BODY>
</HTML>

403
{'Server': 'AkamaiGHost', 'Mime-Version': '1.0', 'Content-Type': 'text/html', 'Content-Length': '265', 'Expires': 'Fri, 26 Jun 2020 19:54:19 GMT', 'Date': 'Fri, 26 Jun 2020 19:54:19 GMT', 'Connection': 'close', 'Server-Timing': 'cdn-cache; desc=HIT, edge; dur=1'}

【问题讨论】：

许多网站不允许跨域主机访问，因为这存在安全风险。也为了防止刮他们的服务器。由于您没有将自己标识为服务器的接受域，因此它会拒绝您的请求。
接受域是什么意思？我通过网络浏览器访问该站点没有问题，我通过任何不是被认可的网络浏览器的方式被拒绝。卷曲/wget/python。使用浏览器中的开发工具，我能够获得 curl / wget 的设置，以及标题但那些被拒绝。
对我来说，术语域可能是糟糕的命名法。 “起源”可能是更好的术语。话虽如此，删除面向对象的方法并仅使用您提供的标头发出请求，我收到了 200 响应和页面 HTML。您在示例 Web_Scrapper 类中指的是“self.headers”，但从未分配过它。这是因为这是快速拼凑的示例代码吗？还是直接来自您的源代码实现？
你能发布 request.get 你用吗？
谢谢你，给了我一些工作。

标签： python-3.x url web-scraping http-get access-denied

【解决方案1】：

这是我另一个项目的代码。通过使用 python fake user agent 你可以绕过这个；使用谷歌了解更多关于我在这里使用的那些模块..

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent
ua = UserAgent()
userAgent = ua.random

chrome_options = Options()

chrome_options.add_argument("--headless")
chrome_options.add_argument(f'user-agent={userAgent}')
driver = webdriver.Chrome(
executable_path=r'C:\Users\ASHIK\Desktop\chromedriver.exe', options=chrome_options)

driver.get("https://www.myntra.com/men?f=Categories%3ATshirts&p=1")
html_doc = driver.page_source
with open('myntra-ecom.html', 'w', encoding='utf-8') as hfile:
    hfile.writelines(html_doc)
    hfile.close()

print("Html file Downloaded...")

【讨论】：