我用python urllib2和beautifulsoup解析html做错了什么答案

【问题标题】：what have I done wrong parsing html with python urllib2 and beautifulsoup我用python urllib2和beautifulsoup解析html做错了什么
【发布时间】：2014-05-17 09:07:32
【问题描述】：

尝试从google上抓取一些链接，学习python

import urllib2
from bs4 import BeautifulSoup

response = urllib2.urlopen('http://www.google.com.au/search?q=python')
html = response.read()
print html
response.close()

我做错了什么？我收到以下错误？

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-4-d990999e71f4> in <module>()
      9 
     10 import urllib2
---> 11 response = urllib2.urlopen('http://www.google.com.au/search?q=python')
     12 html = response.read()
     13 print html

C:\Python27\lib\urllib2.pyc in urlopen(url, data, timeout)
    124     if _opener is None:
    125         _opener = build_opener()
--> 126     return _opener.open(url, data, timeout)
    127 
    128 def install_opener(opener):

C:\Python27\lib\urllib2.pyc in open(self, fullurl, data, timeout)
    395         for processor in self.process_response.get(protocol, []):
    396             meth = getattr(processor, meth_name)
--> 397             response = meth(req, response)
    398 
    399         return response

C:\Python27\lib\urllib2.pyc in http_response(self, request, response)
    508         if not (200 <= code < 300):
    509             response = self.parent.error(
--> 510                 'http', request, response, code, msg, hdrs)
    511 
    512         return response

C:\Python27\lib\urllib2.pyc in error(self, proto, *args)
    433         if http_err:
    434             args = (dict, 'default', 'http_error_default') + orig_args
--> 435             return self._call_chain(*args)
    436 
    437 # XXX probably also want an abstract factory that knows when it makes

C:\Python27\lib\urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    367             func = getattr(handler, meth_name)
    368 
--> 369             result = func(*args)
    370             if result is not None:
    371                 return result

C:\Python27\lib\urllib2.pyc in http_error_default(self, req, fp, code, msg, hdrs)
    516 class HTTPDefaultErrorHandler(BaseHandler):
    517     def http_error_default(self, req, fp, code, msg, hdrs):
--> 518         raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
    519 
    520 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

【问题讨论】：

使用google search API，不要试图爬谷歌。

标签： python html beautifulsoup

【解决方案1】：

看起来 google 不允许这种类型的请求

试试：Requests 或 mechanize

您可以轻松地操纵您的请求标头（用户代理等）检查，哪个更容易，更适合你

【讨论】：

如果我需要从位于 Intranet 后面并需要通过用户名和密码进行身份验证的页面获取 html，该怎么办？机械化和/或美丽的汤/ urllib2可以做到吗？
当然。阅读机械化文档。或查看请求。我创建了一个脚本，登录到某个页面，然后做了一些事情：我将编辑我的帖子并放置请求链接
你能分享脚本来登录吗，我有这个，但是它不适用于我要抓取的页面，也许这个不适合httpS或其他东西阻止它？导入 urllib2 url = 'somesite' 用户名 = 'usr' 密码 = 'pass' p = urllib2.HTTPPasswordMgrWithDefaultRealm() p.add_password(None, url, username, password) handler = urllib2.HTTPBasicAuthHandler(p) opener = urllib2. build_opener(handler) urllib2.install_opener(opener) page = urllib2.urlopen(url).read() 打印页面
我不能分享所有脚本，但这对开始很有用payload = {'Username':'myusername','Password':'mypassword'}s = requests.session()s.post(login_url,payload) 更多请阅读请求文档，因为它真的很简单:)
你觉得它比机械化美丽的汤更好吗？我已经使用美丽的汤来解析链接和字符串，所以我是机械化和请求的新手，我应该使用 BS 来处理链接的东西还是很容易使用机械和请求 - 也就是不需要美丽的汤？