无法理解使用 BeautifulSoup4 和 Python3.x 进行 HTML 解析的 403 错误答案

【问题标题】：Unable to understand the 403 Error from HTML parsing using BeautifulSoup4 with Python3.x无法理解使用 BeautifulSoup4 和 Python3.x 进行 HTML 解析的 403 错误
【发布时间】：2018-06-05 04:02:34
【问题描述】：

我正在参加 Coursera Course Python For Everyone 课程，我尝试了来自textbook 的问题之一：

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = 'https://www.py4e.com/book.htm'
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

我不明白错误：

urllib.error.HTTPError: HTTP Error 403: Forbidden

但是根据完整的错误，它从@Line 18 开始。从阅读其他 SO 和这个Similar Question 来看，它可能与 SSL 证书以及网站如何认为我是机器人有关。为什么代码不起作用？

【问题讨论】：

您可以在您的请求中add a header

标签： python-3.x beautifulsoup ssl-certificate html-parsing urllib

【解决方案1】：

import requests
from bs4 import BeautifulSoup
url = 'https://www.py4e.com/book.htm'
headers = requests.utils.default_headers()
headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})

Link = requests.get(url, headers=headers)
soup =BeautifulSoup(Link.content,"lxml")

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

输出：

http://amzn.to/1KkULF3
book/index.htm
http://amzn.to/1KkULF3
http://amzn.to/1hLcoBy
http://amzn.to/1KkV42z
http://amzn.to/1fNOnbd
http://amzn.to/1N74xLt
http://do1.dr-chuck.net/py4inf/EN-us/book.pdf
http://do1.dr-chuck.net/py4inf/ES-es/book.pdf
https://twitter.com/fertardio
translations/KO/book_009_ko.pdf
http://www.xwmooc.net/python/
http://fanwscu.gitbooks.io/py4inf-zh-cn/
book_270.epub
translations/ES/book_272_es4.epub
https://www.gitbook.com/download/epub/book/fanwscu/py4inf-zh-cn
html-270/
html_270.zip
http://itunes.apple.com/us/book/python-for-informatics/id554638579?mt=13
http://www-personal.umich.edu/~csev/books/py4inf/ibooks//python_for_informatics.
ibooks
http://www.py4inf.com/code
http://www.greenteapress.com/thinkpython/thinkCSpy/
http://allendowney.com/

更新了 urllib 的代码：

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = 'https://www.py4e.com/book.htm'

from urllib.request import Request, urlopen

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

【讨论】：

如果我保留了 urllib 并想使用 request.add_header 怎么办？
我删除了 SSL 认证部分，因为它们不再使用。你能解释一下为什么添加标题有效吗？
@Vince User Agent 字符串在您从计算机/手机浏览网站时发送。当您进行网络抓取时，它使用通常被某些网站阻止的默认用户代理。因此，这就像通过使用 custom User Agent 并在标头中发送它们来欺骗网站，使其相信您正在从不同的设备访问网站。